# These Are Your Tweets on LDA (Part I)

How can we get a sense of what someone tweets about? One way would be to identify themes, or topics, that tend to occur in a user’s tweets. Perhaps we can look through the user’s profile, continually scrolling down and getting a feel for the different topics that they tweet about.

But what if we could use machine learning to discover topics automatically, to measure how much each topic occurs, and even tell us the words that make up the topic?

In this post, we’ll do just that. We’ll retrieve users’ tweets, and use an unsupervised machine learning technique called Latent Dirichlet Allocation (LDA) to uncover topics within the tweets. Then we’ll create visualizations for the topics based on the words that define them. Our tools will be Java, Twitter4J, and Mallet. All of the code is available on GitHub for reference.

As a sneak preview, here’s a visualization of a topic from @gvanrossum:

First I’ll give an intuitive background of LDA, then explain some of the underlying math, and finally move to the code and applications.

What’s LDA?

Intuitively, Latent Dirichlet Allocation provides a thematic summary of a set of documents (in our case, a set of tweets). It gives this summary by discovering ‘topics’, and telling us the proportion of each topic found in a document.

To do so, LDA attempts to model how a document was ‘generated’ by assuming that a document is a mixture of different topics, and assuming that each word is ‘generated’ by one of the topics.

As a simple example, consider the following tweets:

(1) Fruits and vegetables are healthy.

(2) I like apples, oranges, and avocados. I don’t like the flu or colds.

Let’s remove stop words, giving:

(1) fruits vegetables healthy

(2) apples oranges avocados flu colds

We’ll let $k$ denote the number of topics that we think these tweets are generated from. Let’s say there are $k = 2$ topics. Note that there are $V = 8$ words in our corpus. LDA would tell us that:

Topic 1 = Fruits, Vegetables, Apples, Oranges, Avocados
Topic 2 = Healthy, Flu, Colds

And that:

Tweet 1 = (2/3) Topic 1, (1/3) Topic 2
Tweet 2 = (3/5) Topic 1, (2/5) Topic 2

We can conclude that there’s a food topic and a health topic, see words that define those topics, and view the topic composition of each tweet.

Each topic in LDA is a probability distribution over the words. In our case, LDA would give $k = 2$ distributions of size $V = 8$. Each item of the distribution corresponds to a word in the vocabulary. For instance, let’s call one of these distributions $\beta_{1}.$ It might look something like:

$\beta_{1} = [0.4, 0.2, 0.15, 0.05, 0.05, 0.05, 0.05]$

$\beta_{1}$ lets us answer questions such as: given that our topic is Topic #1 (‘Food’), what is the probability of generating word #1 (‘Fruits’)?

Now, I’ll jump into the math underlying LDA to explore specifically what LDA does and how it works. If you still need some more intuition-building, see Edwin Chen’s great blog post. Or feel free to skip to the application if you’d like, but I’d encourage you to read on!

A Bit More Formal

LDA assumes that documents (assumed to be bags of words) are generated by a mixture of topics (distributions over words). We define the following variables and notation:

$k$ is the number of topics.

$V$ is the number of unique words in the vocabulary.

$\theta$ is the topic distribution (of length $k$) for a document, drawn from a uniform Dirichlet distribution with parameter $\alpha$.

$z_{n}$ is a topic 'assignment' for word $w_{n}$, sampled from $p(z_{n} = i|\theta) = \theta_{i}$.

$\textbf{w} = (w_{1}, ... , w_{N})$ is a document with N words.

$w_{n}^{i} = 1$ means that the word $w_{n}$ is the i'th word of the vocabulary.

$\beta$ is a $k \times V$ matrix, where each row $\beta_{i}$ is the multinomial distribution for the $i$th topic. That is, $\beta_{ij} = p(w^{j} = 1 | z_{j} = i)$.



LDA then posits that a document is generated according to the following process:

1. Fix $k$ and $N$.

2. Sample a topic distribution $\theta$ from $Dir(\alpha_{1}, ... , \alpha_{k})$.

$\theta$ defines the topic mixture of the document, so intuitively $\theta_{i}$ is the degree to which $topic_{i}$ appears in the document.

3. For each word index $n \in \left\{1,..., N\right\}$:

4. Draw $z_{n}$ from $\theta$.

$z_{n} = i$ tells us that the word we are about to generate will be generated by topic $i$.

5. Draw a word $w_{n}$ from $\beta_{z_{n}}$.

In other words, we choose the row of $\beta$ based on our value of $z_{n}$ from (4), then sample from the distribution that this row defines. Going back to our example, if we drew the “Food” row in step (4), then it’s more likely that we’ll generate “Fruits” than “Flu” in step (5).

We can see that this does in fact generate a document based on the topic mixture $\theta$, the topic-word assignments z, and the probability matrix $\beta$.

However, we observe the document, and must infer the latent topic mixture and topic-word assignments. Hence LDA aims to infer:

$p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta)$.

Coupling Problems

The story’s not over quite yet, though. We have:

$p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta) = \frac{p(\theta, \textbf{z}, \textbf{w} | \alpha, \beta)}{p(\textbf{w} | \alpha, \beta)}$

Let’s consider the denominator. I’m going to skip the derivation here (see pg. 5 of Reed for the full story), but we have:

$p(\textbf{w} | \alpha, \beta) = \frac{\Gamma(\sum_{i=1}^{k}\alpha_{i})}{\prod_{i=1}^{k}\Gamma(\alpha_{i})} \int (\prod_{i=1}^{k}\theta_{i}^{\alpha_{i}-1})(\prod_{n=1}^{N}\sum_{i=1}^{k}\prod_{j=1}^{V}(\theta_{i}\beta_{ij})^{w_{n}^{j}})\,d\theta$

We cannot separate the $\theta$ and $\beta$, so computing this term is intractable; we must find another approach to infer the hidden variables.

A Simpler Problem

A workaround is to find a convex distribution that lower-bounds the distribution that we want to estimate. Then we can find an optimal lower bound to estimate the distribution that is intractable to compute. We simplify the problem to:

$q(\theta, \textbf{z}|\gamma, \phi) = q(\theta|\gamma)\prod_{n=1}^{N}q(z_{n}|\phi_{n})$

And minimize the KL-Divergence between this distribution and the actual distribution $p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta)$, resulting in the problem:

$(\gamma^{*}, \phi^{*}) = argmin_{\gamma, \phi} D_{KL}(q(\theta, z|\gamma, \phi)||p(\theta, z|w, \alpha, \beta))$

Since we also do not know $\beta$ and $\alpha$, we use Expectation Maximization (EM) to alternate between estimating $\beta$ and $\alpha$ using our current estimates of $\gamma$ and $\phi$, and estimating $\gamma$ and $\phi$ using our current estimates of $\beta$ and $\alpha$.

More specifically, in the E-step, we solve for $(\gamma^{*}, \phi^{*})$, and in the M-step, we perform the updates:

$\beta_{ij} \propto\sum_{d=1}^{M}\sum_{n=1}^{N_{d}}\phi^{*}_{dni}w^{j}_{dn}$

and

$log(\alpha^{t+1}) = log(\alpha^{t}) - \frac{\frac{dL}{d\alpha}}{\frac{d^{2}L}{d\alpha^{2}\alpha}+\frac{dL}{d\alpha}}$

I’ve glossed over this part a bit, but the takeaway is that we must compute a lower bound of the actual distribution, and we use EM to do so since we have two sets of unknown parameters. And in the end, we end up with estimates of $\theta, \textbf{z}, \beta$ as desired.

For more in depth coverage, see Reed’s LDA Tutorial or the original LDA paper.

Now, let’s move to the code.

The Plan

We’ll use @BarackObama as the running example. First we’ll download @BarackObama’s tweets, which will be our corpus, with each tweet representing a ‘document’.

Then, we’ll run LDA on the corpus in order to discover 50 topics and the top 20 words associated with each topic. Next, we will infer the topic distribution over the entire set of tweets. Hence we’ll be able to see topics and the degree to which they appear in @BarackObama’s tweets.

Finally, we’ll visualize the top 20 words for a given topic based on the relative frequency of the words.

I’ll walk through the important parts of the code, but I’ll skip the details in the interest of brevity. For the whole picture, check out the code on GitHub; running the main() method of TopicModel.java will run the entire process and produce similar results to those shown below.

Getting the Tweets

To retrieve the tweets, we’ll rely on the Twitter4J library, an unofficial Java library for the Java API. The code found in TwitterClient.java is a wrapper that helps out with the things we need. The method that does the ‘work’ is

public List<String> getUserTweetsText(String username, int n)

which retrieves the last n of a user’s tweets and returns them as a List of Strings. In this method we access 1 page (200 tweets) at a time, so the main loop has n/200 iterations.

The highest-level method is

public void downloadTweetsFromUser(String username, int numTweets)

which calls getUserTweetsText() and saves the output to files. For organization’s sake, it saves the user’s tweets to

• ./data/{username}/{username}_tweets.txt, which contains one tweet per line
• ./data/{username}/{username}_tweets_single.txt, which contains all of the tweets on a single line. This second file will be used later to infer the topic distribution over the user’s entire set of tweets.

Hence we can download 3000 of @BarackObama’s tweets like so:

TwitterClient tc = new TwitterClient();
tc.downloadTweetsFromUser("BarackObama", 3000);

Hammering Out Some Topics

Now it’s time to take a Mallet to the tweets in order to mine some topics. Mallet is a powerful library for text-based machine learning; we can use its topic modeling through its Java API to load and clean the tweet data, train an LDA model, and output results. In order to use the Mallet API, you’ll have to follow the download instructions and build a jar file, or get it from this post’s GitHub repo.

The Mallet-related code that I’ll discuss next is found in TopicModel.java.

The first step is to prepare and load the data. Our goal is to get the data in the {username}_tweets.txt file into an InstanceList object, i.e. a form that can be used by Mallet models.

To do so, we first create a series of “Pipes” to feed the data through. The idea is that each Pipe performs some transformation of the data and feeds it to the next Pipe. In our case, in

static ArrayList<Pipe> makePipeList()

we create a series of Pipes that will lowercase and tokenize the tweets, remove stop words, and convert the tweets to a sequence of features.

Then, in

static InstanceList fileToInstanceList(String filename)

we iterate through the input file, and use our Pipe list to modify and prepare the data, returning the InstanceList that we set out to build.

Training Time

It’s training time. Using

public static ParallelTopicModel trainModel(InstanceList instances,
int numTopics, int numIters,
double alphaT, double betaW)

we train an LDA model called ParallelTopicModel on the InstanceList data. The betaW parameter is a uniform prior for $\beta$, and the alphaT parameter is the sum of the $\alpha$ parameter; recall from the math section that $\beta$ is the $numtopics \times vocabsize$ matrix that gives word probabilities given a topic, and $\alpha$ is the parameter for the distribution over topics.

Looking at the Output, Part I

With a trained model, we can now look at the words that make up the topics using $\beta$, and the composition $\theta_i$ for $tweet_i$.

Printed to stdout, we see the 50 topics, with the top 20 words for each topic. The number attached to each word is the number of occurrences:

Topic 0
families:13 republican:11 read:7 fast:5 ed:5 op:5 family:5 marine:4 democrat:4 efforts:4 policies:4 story:4 pendleton:3 camp:3 explains:3 military:3 #cir:3 california:3 #familiessucceed:3 workplace:3
Topic 1
president:175 obama:165 middle:61 class:59 jobs:47 #abetterbargain:37 economy:37 good:32 growing:20 #opportunityforall:20 isn:19 #rebuildamerica:18 today:18 create:17 watch:17 infrastructure:16 americans:16 plan:15 live:15 american:15

Topic 48
president:72 address:62 obama:57 watch:56 weekly:49 opportunity:15 economic:15 discusses:14 issue:11 importance:10 working:10 week:10 speak:9 budget:8 discuss:8 congress:7 calls:6 building:6 #opportunityforall:6 lady:5
Topic 49
discrimination:19 lgbt:17 rights:15 #enda:15 americans:14 law:10 act:10 today:10 thedreamisnow:7 screening:7 voting:7 protections:7 basic:7 stand:7 add:7 american:7 anniversary:6 workplace:6 workers:6 support:6

Each box corresponds to taking a row of $\beta$, finding the indices with the 20 highest probabilities, and choosing the words that correspond to these indices.

Since each tweet is a document, the model also contains the topic distribution for each tweet. However, our goal is to get a sense of the overall topic distribution for all of the user’s tweets, which will require an additional step. In other words, we’d like to see a summary of the major topics that the user tends to tweet about.

Getting An Overall Picture

To do so, we will use our trained model to infer the distribution over all of the user’s tweets. We create a single Instance containing all of the tweets using

singleLineFile = {username}_tweets_single.txt

and find the topic distribution:

double[] dist = inferTopicDistribution(model,
TopicModel.fileToInstanceList(singleLineFile));

We can then look at the distribution in  ./data/{username}/{username}_composition.txt:

0 0.006840685214902167
1 0.048881207831654686
...
29 0.09993216645340489
...
48 0.022192334924649955
49 0.01473112438501846

We see, for instance, that topic 29 is more prominent than topic 0; specifically, the model inferred that more words were generated by topic 29 than topic 0.

In ./data/{username}/{username}_ranked.txt we have the top 10 topics, ranked by composition, along with each topic’s top words. For instance, at the top of the file is:

Topic 29:
obama:474
president:474
america:77
...
action:23
making:21
job:21

This topic could probably be labeled as “presidential”; a topic we’d expect to find near the top for @BarackObama.

Looking on, we see a topic that is clearly about healthcare:

Topic 24
health:103
insurance:94
americans:56
#obamacare:55
...
uninsured:21
covered:21

and one about climate change:

Topic 28
change:125
climate:90
#actonclimate:59
#climate:39
...
time:19
#sciencesaysso:14
act:14
call:13
science:13

The inferred topics are pretty amazing; a job well done by LDA. But while viewing words and their frequencies may be fun, let’s visualize a topic in a nicer way.

Into the Clouds

We now have the words that make up the most prominent topics, along with the frequency of each word. A natural visualization for a topic is a word cloud, which allows us to easily see the words and their relative weights.

It turns out that a word-cloud generator named Wordle can create word clouds given a list of weighted words…exactly the format found in ./data/{username}/{username}_ranked.txt !

Let’s copy the word:frequency list for Topic 29 and throw it into Wordle (dialing down obama and president to 200):

and the results:

Topic 29 – “Presidential”

Topic 24 – “Healthcare”

Topic 28 – “Climate Change”

A Brief Pause

Let’s pause for a second. This is a nice moment and a beautiful result.

Take a look at the words : LDA inferred a relationship between “country”, “america”, and “obama”. It grouped “insurance”, “#obamacare”, and “health”. It discovered a link between “climate”, “deniers”, and “#sciencesaysso”.

Glance back up at the math section. We never told it about presidents, countries, or healthcare. Nowhere in there is there a hard-coded link between climate words and science hash tags. In fact, it didn’t even know it would be dealing with words or tweets.

It’s ‘just’ an optimization problem, but when applied it can discover complex relationships that have real meaning. This is an aspect that I personally find fascinating about LDA and, more generally, about machine learning.

As a next step, feel free to download the code and try out other usernames to get a sense of LDA’s generality; its not just limited to @BarackObama, climate, or healthcare.

Next Steps: From One to Many

Currently, we have a nice way of viewing the content of one topic, in isolation. In the next post, we’ll develop a visualization using d3.js for all of the top topics at once. We’ll be able to see and compare topics side by side, and obtain a higher level view of the overall topic distribution. As a sneak preview, here’s a visualization of the top 10 topics from @nytimes: