These Are Your Tweets on LDA (Part II)

In the last post, I gave an overview of Latent Dirichlet Allocation (LDA), and walked through an application of LDA on @BarackObama’s tweets. The final product was a set of word clouds, one per topic, that showed the weighted words that defined the topic.

In this post, we’ll develop a dynamic visualization that incorporates multiple topics, allowing us to gain of a high level view of the topics and also drill down to see the words that define each topic. Through a simple web interface, we’ll also be able to view data from different twitter users.

Click here for an example of the finished product.

As before, all of the code is available on GitHub. The visualization-related code is found in the viz/static directory.

Harnessing the Data

In the last post, we downloaded tweets for a user and found 50 topics that occur in the user’s tweets along with the top 20 words for each topic. We also found the composition of topics across all of the tweets, allowing us to rank the topics by prominence. For our visualization, we’ll choose to display the 10 highest ranked topics for a given twitter user name.

We need a visualization that can show multiple groupings of data. Each of the 10 groupings has 20 words, so we’d also like one that avoids the potential information overload. Finally, we’d like to incorporate the frequencies that we have for each word.

d3

A good fit for these requirements is d3.js‘s Zoomable Pack Layout, which gives us a high level view of each grouping as a bubble. Upon clicking a bubble, we can see the data that comprises the bubble, as well as each data point’s relative weight:

 

d3 to the rescue

d3 Zoomable Pack Layout

In our case, each top-level bubble is a topic, and each inner bubble is a word, with its relative size determined by the word’s frequency.

Since the d3 visualization takes JSON as input, in order to plug in our LDA output data we simply create a toJSON() method in TopicModel.java that outputs the data associated with the top 10 topics to a JSON file. The ‘name’ of each topic is simply the most probably word in the topic.

Now, when the LDA process (the main() method in TopicModel.java) is run for a twitter user, the code will create a corresponding JSON file in viz/json. The JSON structure:

{
 "name":"",
 "children":[
    {
     "name": {topic_1_name},
     "children":[
       {
        "name": {topic_1_word_1},
        "size": {topic_1_word_1_freq}
       },
       {
        "name": {topic_1_word_2},
        "size": {topic_1_word_2_freq}
       },
       {
        "name": {topic_1_word_3},
        "size": {topic_1_word_3_freq}
       },
....

Javascripting

Now, we make slight modifications to the javascript code embedded in the given d3 visualization. Our goal is to be able to toggle between results for different twitter users; we’d like to switch from investigating the @nytimes topics to getting a sense of what @KingJames tweets about.

To do so, we add a drop-down to index.html, such that each time a user is selected on the drop-down, their corresponding JSON is loaded by the show() function in viz.js. Hence we also change the show() function to reload the visualization each time it is called.

Making The Visualizations Visible

To run the code locally, navigate to the viz/static directory and start an HTTP server to serve the content, e.g.

cd {project_root}/viz/static
python -m SimpleHTTPServer

then navigate to http://localhost:8000/index.html to see the visualization.

By selecting nytimes, we see the following visualization which gives a sense of the topics:

@nytimes topics

Upon clicking the ‘gaza’ topic, we see the top words that comprise the topic:

'gaza' topic

I’ve also used Heroku to put an example of the finished visualization with data from 10 different twitter usernames here:

http://ldaviz.herokuapp.com/

Have fun exploring the various topics!

Advertisements

These Are Your Tweets on LDA (Part I)

How can we get a sense of what someone tweets about? One way would be to identify themes, or topics, that tend to occur in a user’s tweets. Perhaps we can look through the user’s profile, continually scrolling down and getting a feel for the different topics that they tweet about.

But what if we could use machine learning to discover topics automatically, to measure how much each topic occurs, and even tell us the words that make up the topic?

In this post, we’ll do just that. We’ll retrieve users’ tweets, and use an unsupervised machine learning technique called Latent Dirichlet Allocation (LDA) to uncover topics within the tweets. Then we’ll create visualizations for the topics based on the words that define them. Our tools will be Java, Twitter4J, and Mallet. All of the code is available on GitHub for reference.

As a sneak preview, here’s a visualization of a topic from @gvanrossum:

Screen Shot 2014-08-30 at 10.04.54 PM

First I’ll give an intuitive background of LDA, then explain some of the underlying math, and finally move to the code and applications.

What’s LDA?

Intuitively, Latent Dirichlet Allocation provides a thematic summary of a set of documents (in our case, a set of tweets). It gives this summary by discovering ‘topics’, and telling us the proportion of each topic found in a document.

To do so, LDA attempts to model how a document was ‘generated’ by assuming that a document is a mixture of different topics, and assuming that each word is ‘generated’ by one of the topics.

As a simple example, consider the following tweets:

(1) Fruits and vegetables are healthy.

(2) I like apples, oranges, and avocados. I don’t like the flu or colds.

Let’s remove stop words, giving:

(1) fruits vegetables healthy

(2) apples oranges avocados flu colds

We’ll let k denote the number of topics that we think these tweets are generated from. Let’s say there are k = 2 topics. Note that there are V = 8 words in our corpus. LDA would tell us that:

Topic 1 = Fruits, Vegetables, Apples, Oranges, Avocados
Topic 2 = Healthy, Flu, Colds

And that:

Tweet 1 = (2/3) Topic 1, (1/3) Topic 2
Tweet 2 = (3/5) Topic 1, (2/5) Topic 2

We can conclude that there’s a food topic and a health topic, see words that define those topics, and view the topic composition of each tweet.

Each topic in LDA is a probability distribution over the words. In our case, LDA would give k = 2 distributions of size V = 8. Each item of the distribution corresponds to a word in the vocabulary. For instance, let’s call one of these distributions \beta_{1}. It might look something like:

\beta_{1} = [0.4, 0.2, 0.15, 0.05, 0.05, 0.05, 0.05]

\beta_{1} lets us answer questions such as: given that our topic is Topic #1 (‘Food’), what is the probability of generating word #1 (‘Fruits’)?

Now, I’ll jump into the math underlying LDA to explore specifically what LDA does and how it works. If you still need some more intuition-building, see Edwin Chen’s great blog post. Or feel free to skip to the application if you’d like, but I’d encourage you to read on!

A Bit More Formal

LDA assumes that documents (assumed to be bags of words) are generated by a mixture of topics (distributions over words). We define the following variables and notation:

k is the number of topics.

V is the number of unique words in the vocabulary.

\theta  is the topic distribution (of length k ) for a document, drawn from a uniform Dirichlet distribution with parameter \alpha .

z_{n} is a topic 'assignment' for word w_{n}, sampled from p(z_{n} = i|\theta) = \theta_{i}.

\textbf{w} = (w_{1}, ... , w_{N}) is a document with N words.

w_{n}^{i} = 1 means that the word w_{n} is the i'th word of the vocabulary.

\beta  is a k \times V matrix, where each row \beta_{i}  is the multinomial distribution for the ith topic. That is, \beta_{ij} = p(w^{j} = 1 | z_{j} = i).

LDA then posits that a document is generated according to the following process:

1. Fix k and N.

2. Sample a topic distribution \theta from Dir(\alpha_{1}, ... , \alpha_{k}) .

\theta defines the topic mixture of the document, so intuitively \theta_{i} is the degree to which topic_{i} appears in the document.

3. For each word index n \in \left\{1,..., N\right\}:

4. Draw z_{n} from \theta.

z_{n} = i tells us that the word we are about to generate will be generated by topic i.

5. Draw a word w_{n} from \beta_{z_{n}}.

In other words, we choose the row of \beta based on our value of z_{n} from (4), then sample from the distribution that this row defines. Going back to our example, if we drew the “Food” row in step (4), then it’s more likely that we’ll generate “Fruits” than “Flu” in step (5).

We can see that this does in fact generate a document based on the topic mixture \theta , the topic-word assignments z, and the probability matrix \beta .

However, we observe the document, and must infer the latent topic mixture and topic-word assignments. Hence LDA aims to infer:

p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta) .

Coupling Problems

The story’s not over quite yet, though. We have:

p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta) = \frac{p(\theta, \textbf{z}, \textbf{w} | \alpha, \beta)}{p(\textbf{w} | \alpha, \beta)}

Let’s consider the denominator. I’m going to skip the derivation here (see pg. 5 of Reed for the full story), but we have:

p(\textbf{w} | \alpha, \beta) = \frac{\Gamma(\sum_{i=1}^{k}\alpha_{i})}{\prod_{i=1}^{k}\Gamma(\alpha_{i})} \int (\prod_{i=1}^{k}\theta_{i}^{\alpha_{i}-1})(\prod_{n=1}^{N}\sum_{i=1}^{k}\prod_{j=1}^{V}(\theta_{i}\beta_{ij})^{w_{n}^{j}})\,d\theta

We cannot separate the \theta and \beta, so computing this term is intractable; we must find another approach to infer the hidden variables.

A Simpler Problem

A workaround is to find a convex distribution that lower-bounds the distribution that we want to estimate. Then we can find an optimal lower bound to estimate the distribution that is intractable to compute. We simplify the problem to:

q(\theta, \textbf{z}|\gamma, \phi) = q(\theta|\gamma)\prod_{n=1}^{N}q(z_{n}|\phi_{n})

And minimize the KL-Divergence between this distribution and the actual distribution p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta) , resulting in the problem:

(\gamma^{*}, \phi^{*}) = argmin_{\gamma, \phi} D_{KL}(q(\theta, z|\gamma, \phi)||p(\theta, z|w, \alpha, \beta))

Since we also do not know \beta and \alpha, we use Expectation Maximization (EM) to alternate between estimating \beta and \alpha using our current estimates of \gamma and \phi, and estimating \gamma and \phi using our current estimates of \beta and \alpha.

More specifically, in the E-step, we solve for (\gamma^{*}, \phi^{*}), and in the M-step, we perform the updates:

\beta_{ij} \propto\sum_{d=1}^{M}\sum_{n=1}^{N_{d}}\phi^{*}_{dni}w^{j}_{dn}

and

log(\alpha^{t+1}) = log(\alpha^{t}) - \frac{\frac{dL}{d\alpha}}{\frac{d^{2}L}{d\alpha^{2}\alpha}+\frac{dL}{d\alpha}}

I’ve glossed over this part a bit, but the takeaway is that we must compute a lower bound of the actual distribution, and we use EM to do so since we have two sets of unknown parameters. And in the end, we end up with estimates of \theta, \textbf{z}, \beta as desired.

For more in depth coverage, see Reed’s LDA Tutorial or the original LDA paper.

Now, let’s move to the code.

The Plan

We’ll use @BarackObama as the running example. First we’ll download @BarackObama’s tweets, which will be our corpus, with each tweet representing a ‘document’.

Then, we’ll run LDA on the corpus in order to discover 50 topics and the top 20 words associated with each topic. Next, we will infer the topic distribution over the entire set of tweets. Hence we’ll be able to see topics and the degree to which they appear in @BarackObama’s tweets.

Finally, we’ll visualize the top 20 words for a given topic based on the relative frequency of the words.

I’ll walk through the important parts of the code, but I’ll skip the details in the interest of brevity. For the whole picture, check out the code on GitHub; running the main() method of TopicModel.java will run the entire process and produce similar results to those shown below.

Getting the Tweets

To retrieve the tweets, we’ll rely on the Twitter4J library, an unofficial Java library for the Java API. The code found in TwitterClient.java is a wrapper that helps out with the things we need. The method that does the ‘work’ is

public List<String> getUserTweetsText(String username, int n)

which retrieves the last n of a user’s tweets and returns them as a List of Strings. In this method we access 1 page (200 tweets) at a time, so the main loop has n/200 iterations.

The highest-level method is

public void downloadTweetsFromUser(String username, int numTweets)

which calls getUserTweetsText() and saves the output to files. For organization’s sake, it saves the user’s tweets to

  • ./data/{username}/{username}_tweets.txt, which contains one tweet per line
  • ./data/{username}/{username}_tweets_single.txt, which contains all of the tweets on a single line. This second file will be used later to infer the topic distribution over the user’s entire set of tweets.

Hence we can download 3000 of @BarackObama’s tweets like so:

TwitterClient tc = new TwitterClient();
tc.downloadTweetsFromUser("BarackObama", 3000);

Hammering Out Some Topics

Now it’s time to take a Mallet to the tweets in order to mine some topics. Mallet is a powerful library for text-based machine learning; we can use its topic modeling through its Java API to load and clean the tweet data, train an LDA model, and output results. In order to use the Mallet API, you’ll have to follow the download instructions and build a jar file, or get it from this post’s GitHub repo.

The Mallet-related code that I’ll discuss next is found in TopicModel.java.

Loading the Data

The first step is to prepare and load the data. Our goal is to get the data in the {username}_tweets.txt file into an InstanceList object, i.e. a form that can be used by Mallet models.

To do so, we first create a series of “Pipes” to feed the data through. The idea is that each Pipe performs some transformation of the data and feeds it to the next Pipe. In our case, in

static ArrayList<Pipe> makePipeList()

we create a series of Pipes that will lowercase and tokenize the tweets, remove stop words, and convert the tweets to a sequence of features.

Then, in

static InstanceList fileToInstanceList(String filename)

we iterate through the input file, and use our Pipe list to modify and prepare the data, returning the InstanceList that we set out to build.

 Training Time

It’s training time. Using

public static ParallelTopicModel trainModel(InstanceList instances,
                                      int numTopics, int numIters,
                                      double alphaT, double betaW)

we train an LDA model called ParallelTopicModel on the InstanceList data. The betaW parameter is a uniform prior for \beta , and the alphaT parameter is the sum of the \alpha parameter; recall from the math section that \beta is the numtopics \times vocabsize matrix that gives word probabilities given a topic, and \alpha is the parameter for the distribution over topics.

Looking at the Output, Part I

With a trained model, we can now look at the words that make up the topics using \beta , and the composition \theta_i for tweet_i .

Printed to stdout, we see the 50 topics, with the top 20 words for each topic. The number attached to each word is the number of occurrences:

Topic 0 
families:13 republican:11 read:7 fast:5 ed:5 op:5 family:5 marine:4 democrat:4 efforts:4 policies:4 story:4 pendleton:3 camp:3 explains:3 military:3 #cir:3 california:3 #familiessucceed:3 workplace:3
Topic 1 
president:175 obama:165 middle:61 class:59 jobs:47 #abetterbargain:37 economy:37 good:32 growing:20 #opportunityforall:20 isn:19 #rebuildamerica:18 today:18 create:17 watch:17 infrastructure:16 americans:16 plan:15 live:15 american:15

Topic 48 
president:72 address:62 obama:57 watch:56 weekly:49 opportunity:15 economic:15 discusses:14 issue:11 importance:10 working:10 week:10 speak:9 budget:8 discuss:8 congress:7 calls:6 building:6 #opportunityforall:6 lady:5
Topic 49 
discrimination:19 lgbt:17 rights:15 #enda:15 americans:14 law:10 act:10 today:10 thedreamisnow:7 screening:7 voting:7 protections:7 basic:7 stand:7 add:7 american:7 anniversary:6 workplace:6 workers:6 support:6

Each box corresponds to taking a row of \beta , finding the indices with the 20 highest probabilities, and choosing the words that correspond to these indices.

Since each tweet is a document, the model also contains the topic distribution for each tweet. However, our goal is to get a sense of the overall topic distribution for all of the user’s tweets, which will require an additional step. In other words, we’d like to see a summary of the major topics that the user tends to tweet about.

Getting An Overall Picture

To do so, we will use our trained model to infer the distribution over all of the user’s tweets. We create a single Instance containing all of the tweets using

singleLineFile = {username}_tweets_single.txt

and find the topic distribution:

double[] dist = inferTopicDistribution(model, 
                    TopicModel.fileToInstanceList(singleLineFile));

We can then look at the distribution in  ./data/{username}/{username}_composition.txt:

0 0.006840685214902167
1 0.048881207831654686
...
29 0.09993216645340489
...
48 0.022192334924649955
49 0.01473112438501846

We see, for instance, that topic 29 is more prominent than topic 0; specifically, the model inferred that more words were generated by topic 29 than topic 0.

In ./data/{username}/{username}_ranked.txt we have the top 10 topics, ranked by composition, along with each topic’s top words. For instance, at the top of the file is:

Topic 29:
obama:474
president:474
america:77
...
action:23
making:21
job:21

This topic could probably be labeled as “presidential”; a topic we’d expect to find near the top for @BarackObama.

Looking on, we see a topic that is clearly about healthcare:

Topic 24
health:103
insurance:94
americans:56
#obamacare:55
...
uninsured:21
covered:21

and one about climate change:

Topic 28
change:125
climate:90
#actonclimate:59
#climate:39
...
time:19
#sciencesaysso:14
act:14
call:13
science:13

The inferred topics are pretty amazing; a job well done by LDA. But while viewing words and their frequencies may be fun, let’s visualize a topic in a nicer way.

Into the Clouds

We now have the words that make up the most prominent topics, along with the frequency of each word. A natural visualization for a topic is a word cloud, which allows us to easily see the words and their relative weights.

It turns out that a word-cloud generator named Wordle can create word clouds given a list of weighted words…exactly the format found in ./data/{username}/{username}_ranked.txt !

Let’s copy the word:frequency list for Topic 29 and throw it into Wordle (dialing down obama and president to 200):

and the results:

Topic 29 – “Presidential”

Topic 24 – “Healthcare”

Topic 28 – “Climate Change”

A Brief Pause

Let’s pause for a second. This is a nice moment and a beautiful result.

Take a look at the words : LDA inferred a relationship between “country”, “america”, and “obama”. It grouped “insurance”, “#obamacare”, and “health”. It discovered a link between “climate”, “deniers”, and “#sciencesaysso”.

Glance back up at the math section. We never told it about presidents, countries, or healthcare. Nowhere in there is there a hard-coded link between climate words and science hash tags. In fact, it didn’t even know it would be dealing with words or tweets.

It’s ‘just’ an optimization problem, but when applied it can discover complex relationships that have real meaning. This is an aspect that I personally find fascinating about LDA and, more generally, about machine learning.

As a next step, feel free to download the code and try out other usernames to get a sense of LDA’s generality; its not just limited to @BarackObama, climate, or healthcare.

Next Steps: From One to Many

Currently, we have a nice way of viewing the content of one topic, in isolation. In the next post, we’ll develop a visualization using d3.js for all of the top topics at once. We’ll be able to see and compare topics side by side, and obtain a higher level view of the overall topic distribution. As a sneak preview, here’s a visualization of the top 10 topics from @nytimes:

@nytimes top 10

Credits and Links

Much of the mathematical content is derived from Reed’s Tutorial and the LDA Paper (or the shorter version). These are also great resources for learning more. Edwin Chen’s blog post also provides a good introduction and an intuition-building example. The Mallet developer’s guide and data importing guide provide good examples of using the Mallet API. The Programming Historian has a great intro to using Mallet for LDA from the command line.