These Are Your Tweets on LDA (Part II)

In the last post, I gave an overview of Latent Dirichlet Allocation (LDA), and walked through an application of LDA on @BarackObama’s tweets. The final product was a set of word clouds, one per topic, that showed the weighted words that defined the topic.

In this post, we’ll develop a dynamic visualization that incorporates multiple topics, allowing us to gain of a high level view of the topics and also drill down to see the words that define each topic. Through a simple web interface, we’ll also be able to view data from different twitter users.

Click here for an example of the finished product.

As before, all of the code is available on GitHub. The visualization-related code is found in the viz/static directory.

Harnessing the Data

In the last post, we downloaded tweets for a user and found 50 topics that occur in the user’s tweets along with the top 20 words for each topic. We also found the composition of topics across all of the tweets, allowing us to rank the topics by prominence. For our visualization, we’ll choose to display the 10 highest ranked topics for a given twitter user name.

We need a visualization that can show multiple groupings of data. Each of the 10 groupings has 20 words, so we’d also like one that avoids the potential information overload. Finally, we’d like to incorporate the frequencies that we have for each word.


A good fit for these requirements is d3.js‘s Zoomable Pack Layout, which gives us a high level view of each grouping as a bubble. Upon clicking a bubble, we can see the data that comprises the bubble, as well as each data point’s relative weight:


d3 to the rescue

d3 Zoomable Pack Layout

In our case, each top-level bubble is a topic, and each inner bubble is a word, with its relative size determined by the word’s frequency.

Since the d3 visualization takes JSON as input, in order to plug in our LDA output data we simply create a toJSON() method in that outputs the data associated with the top 10 topics to a JSON file. The ‘name’ of each topic is simply the most probably word in the topic.

Now, when the LDA process (the main() method in is run for a twitter user, the code will create a corresponding JSON file in viz/json. The JSON structure:

     "name": {topic_1_name},
        "name": {topic_1_word_1},
        "size": {topic_1_word_1_freq}
        "name": {topic_1_word_2},
        "size": {topic_1_word_2_freq}
        "name": {topic_1_word_3},
        "size": {topic_1_word_3_freq}


Now, we make slight modifications to the javascript code embedded in the given d3 visualization. Our goal is to be able to toggle between results for different twitter users; we’d like to switch from investigating the @nytimes topics to getting a sense of what @KingJames tweets about.

To do so, we add a drop-down to index.html, such that each time a user is selected on the drop-down, their corresponding JSON is loaded by the show() function in viz.js. Hence we also change the show() function to reload the visualization each time it is called.

Making The Visualizations Visible

To run the code locally, navigate to the viz/static directory and start an HTTP server to serve the content, e.g.

cd {project_root}/viz/static
python -m SimpleHTTPServer

then navigate to http://localhost:8000/index.html to see the visualization.

By selecting nytimes, we see the following visualization which gives a sense of the topics:

@nytimes topics

Upon clicking the ‘gaza’ topic, we see the top words that comprise the topic:

'gaza' topic

I’ve also used Heroku to put an example of the finished visualization with data from 10 different twitter usernames here:

Have fun exploring the various topics!


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s