ECIR 2016 Paper and Presentation

I recently presented a paper entitled Efficient AUC Optimization for Information Ranking Applications at the European Conference for Information Retrieval (ECIR) in Padua, Italy.

In the paper, we derive gradient approximations for optimizing area under an ROC curve (AUC) and multi-class AUC using the LambdaMART algorithm. Here are slides for the talk, which give an overview of the paper and a brief review of Learning to Rank and LambdaMART:

The goal was to expand LambdaMART to optimize AUC in order to add to the portfolio of metrics that the algorithm is currently used for. The approach was to derive a “λ-gradient” analogous to those that have been defined for other metrics such as NDCG and Mean Average Precision.

This in turn required an efficient way of computing a key quantity, namely the change in the AUC from swapping two items in the ranked list. One contribution of the paper is a simple, efficient formula for computing this quantity (along with a proof), which appears in the paper as:


The paper also contains experimental results measuring the performance of LambdaMART using the derived gradients.

Peering into the Black Box : Visualizing LambdaMART

In the last post, I gave a broad overview of the Learning to Rank domain of machine learning that has applications in web search, machine translation, and question-answering systems. In this post, we’ll look at a state of the art model used in Learning to Rank called LambdaMART. We’ll take a look at some math underlying LambdaMART, then focus on developing ways to visualize the model.

LambdaMART produces a tree ensemble model, a class of models traditionally viewed as ‘black boxes’ since they take into account predictions of 10’s or 100’s of underlying trees. While viewing a single decision tree is intuitive and interpretable, viewing 100 at once would be overwhelming and uninformative:



Single Tree


Ensemble of Trees

 Ensemble of Trees

By moving from a single tree to an ensemble of trees, we tradeoff interpretability for performance.

In this post, we’ll take a look at the trees produced by the LambdaMART training process, and develop a visualization to view the ensemble of trees as a single collective unit. By doing so, we’ll gain back some of the intuitive interpretability that make tree models appealing. Through the visualization, we’ll peer into the black box model and gain some sense of the factors that the model uses to make predictions.

We use Java and the JForests library to train LambdaMART and parse its output, and d3.js for visualization. To get a sneak peek, the final visualization is here.

LambdaMART Overview

Let’s take a look at some details of LambdaMART. For the full story, check out this paper from Microsoft Research.

At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG. To understand LambdaMART we’ll look at two aspects: Lambda and MART.


LambdaMART is a specific instance of Gradient Boosted Regression Trees, also referred to as Multiple Additive Regression Trees (MART). Gradient Boosting is a technique for forming a model that is a weighted combination of an ensemble of “weak learners”. In our case, each “weak learner” is a decision tree.

Our goal is to find a function f(x) that minimizes the expected loss L:

\hat{f}(x) = \arg\min_{f(x)}E[L(y, f(x))|x]

for feature vectors x and labels y.

To do so we first view \hat{f}(x) as a sum of weak learners f_m:

\hat{f}(x)=\sum_{m=1}^M f_{m}(x)

Since we are dealing with decision trees, evaluating f_{m}(x) corresponds to passing x down the tree f_{m} until it reaches a leaf node l. The predicted value is then equal to a parameter \gamma_l.

Decision trees consist of two types of parameters: region assignments R_i that assign a training example x_i to a leaf node l, and leaf outputs \gamma_l that represent the tree’s output for all examples assigned to region R_l. Hence we can write each weak learner f_{m}(x) as a parameterized tree h_{m}(x;R_{m}, \gamma_{m}), giving:

\hat{f}(x)=\sum_{m=1}^M f_{m}(x)=\sum_{m=1}^M h_{m}(x; R_{m}, \gamma_{m})

Gradient Boosting finds each tree in a stepwise fashion. We start with an initial tree f_1, then step to another tree f_2, and so on:

\hat{f}_{1}(x) = f_{1}

\hat{f}_{2}(x) = f_{1} + f_{2}

\hat{f}_{M}(x) = \sum_{m=1}^{M}f_{m}

How do we determine the steps? At each step, we’d like the model to change such that the loss decreases as much as possible. The locally optimal decrease corresponds to the gradient of the loss with respect to the current model m:

g_{im}=\frac{\partial L[f(x_i), y_i]}{\partial f(x_i)}

Hence we define the step as f_m=-\rho_mg_m. This gives us insight into how Gradient Boosting solves the minimization – the algorithm is performing Gradient Descent in function space.

Now, we know that we want to take a gradient step, but still haven’t said how that translates to finding a tree. To do so, we build a tree that models the gradient; by adding such a tree to the ensemble, we effectively take a gradient step from the previous model. To fit the tree, we use squared error loss, giving the minimization problem:

argmin_{R, \gamma}\sum_{i=1}^N(-g_{im}-F(x_i;R, \gamma))^2

Where F(x_i;R, \gamma) denotes a regression tree with parameters R, \gamma.

Let’s take a step back. We’ve just set up a framework for finding an ensemble of trees that, when added together, minimizes a loss function. It’s a “framework” in the sense that we can use it if we merely supply gradients of the loss function at each training point. This is where the “Lambda” part of LambdaMART comes into play.

For further reading, Chapter 10 of Elements of Statistical Learning provides a great and thorough overview. This paper also provides a good intro of Gradient Boosting.


In ranking, the loss function that we’ll most likely care about optimizing is probably either NDCG, MAP, or MRR. Unfortunately, these loss functions aren’t differentiable at all points, so we can’t use them directly in our Gradient Boosting “framework”, since it’s unclear how we can provide the gradients at each training point.

To address this, LambdaMART uses an idea from a model called LambdaRank. For each training point pair i,j, we compute a value \lambda_{i,j} that acts as the gradient we need. Intuitively, \lambda_{i,j} can be thought of as a force that moves documents up and down the ranked list. For instance, if document i is ranked lower than document j, then document i will receive a push of size |\lambda_{i,j}| downwards in the ranked list, and j will be pushed upwards the same amount. The LambdaMART paper reports that empirically, the \lambda‘s have been used successfully to optimize NDCG, MAP, and MRR.

Hence in LambdaMART we use the Gradient Boosting framework with the \lambda_i‘s acting as the gradients g_i. This leads to update rules for the leaf values \gamma_l that depend on the \lambda values of the training instances that fall in leaf l.

Visualizing Gradient Boosting

We’ve observed that LambdaMART trains an ensemble of trees sequentially. What does this sequence of trees look like? Let’s look at an ensemble of 19 trees trained with LambdaMART on the OHSUMED dataset:


At each inner node, the small number denotes the feature label, and the larger number denotes the threshold. The number at each leaf node denotes the leaf output.

We can see that each tree is somewhat similar to the previous tree. This makes sense given that each tree t_i is dependent on the model state F_{i-1} since it is fit to the gradient with respect to that model’s gradient.

The step-by-step animation gives us insight into what is happening during training. It’s a great way to visualize and gain an intuition of the gradient boosting process – we see consecutive trees being produced that depend on the previous tree. However, we are still viewing the LambdaMART model as a group of separate trees; we still need to develop a way of viewing the final model cohesively.

Does the sequence animation tell us anything that could help with developing a single, cohesive model view? We can start with the observation that often, the trees look “roughly similar”.

Sometimes two trees will have the same feature at a given node position; for instance the root favors features 7, 9, and 11. Trees may not even change from step-to-step or may share a similar overall structure, such as trees #3 and #4 – notice how the numbers change, but the structure remains the same. Finally, all of the trees only draw from a subset of the total feature set; for instance feature 4 is never used by any of the trees.

Grouping Similar Nodes

Given these observations, we can think about how to build a unified visualization that takes advantage of reused features and similar structure between trees. Let’s consider the root node. We can count the frequency of each feature that occurs as the root node across the entire ensemble, leading to a single node that shows that feature 7 was used as the feature for the root node for 3 trees, feature 9 for 5 trees, and so on:


At each tree level h, there are 2^h possible node positions. We can collect the feature frequencies for each of these positions across the ensemble. If a position doesn’t appear in a tree, we mark it as ‘DNE’, and if a position is a leaf, we mark it as ‘Leaf’, such as position (3, 6):


Heatmap Tree

At each node, we have a list of feature (or DNE or Leaf) counts. Heatmaps provide a natural way of visualizing this count information. We can make each node of the ordered-pair-tree into a heatmap, giving rise to the “Heatmap Tree”:


The Heatmap Tree

To view the interactive d3.js visualization, see this link.

The Heatmap Tree shows the entire ensemble as a single unit. Each number is a feature, and the color scale tells us how many trees use that feature at the given node:

Heatmap Detail

By looking at the Heatmap Tree we can get a sense of which features the tree uses when it classifies instances. The ensemble uses features 7, 9, 11, 25, and 29 to perform the initial split, with 11 being used the most often. Further down the tree, we see features 7 and 9 again, along with common features such as 33 and 37. We can easily see that most of the trees in the ensemble are between 5 and 7 levels deep, and that while there are 46 total features, at a given node location the most variation we see is 9 features.

The Heatmap Tree can bring together hundreds of trees, such as this visualization of a 325 tree ensemble:


Tuning Parameters

The Heatmap Tree also lets us see how tuning parameters affect the final model. For instance, as we decrease the learning rate from 0.10 to 0.001, we see the ensemble size fluctuate:


Learning Rate 0.10


Learning Rate 0.01


Learning Rate 0.001

Notice how in the 0.01 case, the Heatmap Tree concisely summarized a 111-tree ensemble.

When we use feature subsampling, we see the number of features at each node increase (in general). Each tree has a different limited subset of features to choose from, leading to a spread in the overall distribution of chosen features:


Feature subsampling 0.6

Feature subsampling and training data subsampling makes this ensemble more crowded:


Feature subsampling 0.6, Data subsampling 0.6

Note that these parameter trends do not necessarily generalize. However, the Heatmap Tree captures all of the trees and features in a single structure, and gives us insight into the structural results of our parameter choices.


The Heatmap Tree, unfortunately, has its limits. With wide variation of features and many leaves, the tree becomes crowded:


Since the number of possible node locations at a given level increases exponentially with height, the tree also suffers when trying to visualize deep trees.


Another nice aspect of decision trees is that we can visualize how a test instance gets classified; we simply show the path it takes from root to leaf.

How could we visualize the classification process in an ensemble, via the Heatmap Tree or otherwise? With the Heatmap Tree, we would need to be able to simultaneously visualize 10’s or 100’s of paths, since there would be an individual path for every tree in the ensemble. One idea is to have weighted edges on the Heatmap Tree; an edge would become thicker each time the edge is used when classifying an instance.

Another next step is to test the generalizability of this visualization; could it work for any gradient boosting model? What would a Heatmap Tree of a random forest look like?


We’ve taken a close look at LambdaMART and gradient boosting. We’ve devised a way to capture the complexity of gradient boosted tree ensembles in a cohesive way. In doing so we bought back some of the interpretability that we lost by moving from a single tree to an ensemble, gained insight into the training process, and made the black box LambdaMART model a bit more transparent.

To see an interactive d3 Heatmap Tree, visit this link.

References and Further Reading

From RankNet to LambdaRank to LambdaMART: An Overview

Gradient Boosting Machines: A Tutorial

“Tree Ensembles for Learning to Rank”, Yasser Ganjisaffar 2011 PhD Thesis

LDAOverflow with Online LDA

In the past two posts (part I and part II), we used Latent Dirichlet Allocation (LDA) to discover topics for tweets, and visualized them. In this post, we’ll investigate using LDA on an 8gb dataset of around 8 million Stack Overflow posts.

We’ll need to take a different approach; for the tweets we used a batch algorithm that worked well for the relatively small dataset of around 5000 tweets, but would likely introduce performance issues when running on massive datasets. The batch algorithm also assumed that we have the entire training set at the start of training, making the approach unviable for streaming data, which we may receive during training. It’d be nice to train the model incrementally, so that we can train on a chunk of data, then resume training if we receive more data without have to retrain on the original chunk.

In this post, we’ll look at Online LDA, a variation of ‘vanilla’ LDA that can be trained incrementally in small batches. Online LDA is a good choice for large datasets since we only need to hold a very small subset of the dataset in memory at a given time, and a good fit for streaming data since we can continually feed in new data batches as we receive them. We’re also able to save the model state at a point in training, then resume later when we want to train on more data.

First, we’ll jump into the math and look at the differences between online and batch LDA. Then we’ll use a python implementation of online LDA to discover topics for the Stack Overflow dataset. As usual, all of the associated code is available on GitHub.

Variations on Variational Bayes

For brevity this part will assume that you’ve read through the math background in the first LDA post. I’ll also only highlight major parts; for the full story check out Hoffman’s online LDA paper.

In LDA, our ultimate goal is to find the posterior distribution of latent topic variables after observing training data. However, computing this distribution is intractable, so we’re forced to approximate. One approximation approach is to use an optimization method called Variational Bayes.

In short, we approximate the true distribution by a simple distribution q, and associate parameters \phi,\ \gamma,\ \lambda with the original parameters z,\ \theta,\ \beta respectively. Recall that z gives the topic assignments for each word in each document , \theta gives the topic composition of each document, and \beta gives the word-topic probabilities for each word and each topic.

Specifically, we have:

q(z_{di}=k) = \phi_{dw_{di}k}

q(\theta_{d}) = dirichlet(\theta_{d}; \gamma_{d})

q(\beta_{k}) = dirichlet(\beta_{k}; \lambda_{k})

Our goal is to estimate \phi,\ \gamma,\ \lambda. In both batch and online LDA, we alternate between two steps:

1. E-Step: Estimate \phi,\ \gamma using the current value of \lambda

2. M-Step: Update \lambda , using the current value of \phi

The core difference between batch and online LDA is in how these steps are carried out at the algorithmic level.

Starting with Batch

In batch Variational Bayes, we perform multiple passes over the entire dataset, checking each time for convergence. During each pass, the algorithm does an E-Step using the entire dataset. At a high level:

for d = 1 to numDocs
    initialize \gamma
    repeat until change in \phi < \epsilon
        update \phi
        update \gamma

Then the M-Step updates \lambda using \phi values from every document:

update \lambda

The specific updates are:

\phi_{d,word_{i},k} \propto e^{E_{q}(log \theta_{d,k})+E_{q}(log\beta_{k,word_{i}})}

\gamma_{d,k} = \alpha + \sum_{word_{i}}\phi_{d,word_{i},k}n_{d,word_{i}}

\lambda_{k,word_{i}}=\eta +\sum_{d}n_{d,word_{i}}\phi_{d,word_{i},k}

Where n_{d,word_{i}} is the number of occurrences of word_{i} in document d.

Going Online

In online Variational Bayes, we only make a single sweep of the entire dataset, analyzing a chunk of documents at a time. A ‘chunk’ could be a single document, 42 documents, or even the entire dataset. Let’s let a ‘chunk’ be 1000 documents.

The online E-Step only uses the current chunk; instead of 8 million posts we now only have to hold 1000 in memory. The E-Step finds locally optimal values for \phi and \gamma:

initialize \gamma
repeat until change in \phi < \epsilon
    update \phi
    update \gamma

In the M-Step, we first compute \lambda', which is the value of \lambda if we imagined that the entire dataset is made up of \frac{numDocs}{chunkSize} copies of the current chunk. Then \lambda is updated using a weighted sum of \lambda' and \lambda:

compute \lambda'
update \lambda

The specific updates are:

\phi_{iter,word_{i},k} \propto e^{E_{q}(log \theta_{iter,k})+E_{q}(log\beta_{k,word_{i}})}

\gamma_{iter,k} = \alpha + \sum_{word_{i}}\phi_{iter,word_{i},k}n_{iter,word_{i}}

\lambda'_{k,word_{i}}=\eta +batchSize*n_{iter,word_{i}}\phi_{iter,word_{i},k}

\lambda = (1-\rho_t)\lambda+\rho_t\lambda'

Where n_{iter,word_{i}} is the number of occurrences of word_{i} in the current iteration’s chunk of documents, and \rho_t is a weighting parameter.

We can see that unlike batch LDA, in online LDA we only need to hold a small chunk of the data at a time, and once we’re done analyzing it, we never need it again. As with batch, once we’ve estimated \lambda, we can find the most probable words for each topic by looking at the word probabilities in each row of \lambda.

Intuitions of the Inference

If we squint and step back, LDA consists of using simple word counts in a clever way. The two parameters we ultimately care about are \gamma and \lambda. How do these get updated during training?

Updates of \gamma (the topic compositions for each document) are the prior \alpha plus a weighted sum of word counts. The word counts are weighted by \phi_{word,topic}, the probability of assigning the word to the topic. Intuitively, if we count a lot of instances of “potato” in a document, and “potato” is likely to be assigned to topic 2, then it makes sense that the document has more of topic 2 in it than we previously thought.

Updates of \lambda_{topic,word} (the word-topic probabilities) use word counts weighted by the probability that the word will be assigned to the given topic. If “potato” shows up a lot in the dataset is likely to be assigned to topic 2, then it makes sense that \lambda_{2, potato} should increase.

LDA Overflow

Now it’s time to run Online LDA on the Stack Overflow dataset to discover topics without overflowing our memory. Stack Exchange kindly provides (and updates) a data dump of all of its user generated content; I chose the dataset.

Read, Clean, Parse, Repeat

The data arrives as a 27gb XML behemoth. The first step is isolating the text from the Title and Body fields for each row. These fields will comprise a ‘document’, and our dataset will be formatted as a text file with one document per line.

Since the file is so large, we need to incrementally read the XML. We also filter out non alpha-numeric characters. Details for this process can be found in

Once runs, we get an 8gb text file containing around 8,000,000 stack overflow documents (title and body content). A couple examples:

Throw an error in a MySQL trigger If I have a trigger before the update on a table how can I throw an error that prevents the update on that table

Compressing Decompressing Folders Files Does anyone know of a good way to compress or decompress files and folders in C quickly Handling large files might be necessary

LDA by Hoffman

We’ll use a Python implementation of online LDA written by Matt Hoffman, available on his webpage. We need to adapt the high-level running script for our application; to do so I created a wrapper for running LDA called Use

python -h

to see the various command line arguments.

I’ve also added more comments to on the repo in case you’d like to further inspect how the actual Online LDA algorithm is implemented.

Building a Vocabulary

The LDA implementation assumes that we have a vocabulary file prior to training so that it can compactly represent documents as numeric word IDs. The vocabulary also allows us the algorithm to associate an index of \beta with a word ID and hence with a word.

We can generate a domain-specific vocabulary using the first 100,000 Stack Overflow posts, and supplement it with the vocabulary provided by Hoffman, which contains the most frequent English words. Gensim has a nice library for creating vocabularies. We filter out words that appear in fewer than 10 documents, since they are often ‘junk’ words, and would probably not appear in the top words for a topic anyways since they appear so infrequently. Code for the vocabulary generation is found in


Let’s kick it off!

python dataset.txt vocabulary.txt

The training took ~12 hours for a 100 topic model on my MacBook. The values of \gamma and \lambda are output to files every 10,000 iterations and when the training completes. We can then use one of the \lambda files to see the top 20 words for the top N topics. For instance, to print the top 2 topics with the final model, use:

python vocabulary.txt lambda-final.dat 2


topic 0

topic 1

The numbers are the word-topic probabilities from \lambda.

We’ll use the approach from the first LDA post to create word clouds for two different topics:



We managed to find topics on a dataset of 8 million Stack Overflow posts by using Online LDA. Feel free to download the code and try it out on other datasets!

Credits & Links

Much of the material is derived from Hoffman’s paper and his online LDA implementation. Gensim and Vowpal Wabbit also have implementations of online LDA.

These Are Your Tweets on LDA (Part II)

In the last post, I gave an overview of Latent Dirichlet Allocation (LDA), and walked through an application of LDA on @BarackObama’s tweets. The final product was a set of word clouds, one per topic, that showed the weighted words that defined the topic.

In this post, we’ll develop a dynamic visualization that incorporates multiple topics, allowing us to gain of a high level view of the topics and also drill down to see the words that define each topic. Through a simple web interface, we’ll also be able to view data from different twitter users.

Click here for an example of the finished product.

As before, all of the code is available on GitHub. The visualization-related code is found in the viz/static directory.

Harnessing the Data

In the last post, we downloaded tweets for a user and found 50 topics that occur in the user’s tweets along with the top 20 words for each topic. We also found the composition of topics across all of the tweets, allowing us to rank the topics by prominence. For our visualization, we’ll choose to display the 10 highest ranked topics for a given twitter user name.

We need a visualization that can show multiple groupings of data. Each of the 10 groupings has 20 words, so we’d also like one that avoids the potential information overload. Finally, we’d like to incorporate the frequencies that we have for each word.


A good fit for these requirements is d3.js‘s Zoomable Pack Layout, which gives us a high level view of each grouping as a bubble. Upon clicking a bubble, we can see the data that comprises the bubble, as well as each data point’s relative weight:


d3 to the rescue

d3 Zoomable Pack Layout

In our case, each top-level bubble is a topic, and each inner bubble is a word, with its relative size determined by the word’s frequency.

Since the d3 visualization takes JSON as input, in order to plug in our LDA output data we simply create a toJSON() method in that outputs the data associated with the top 10 topics to a JSON file. The ‘name’ of each topic is simply the most probably word in the topic.

Now, when the LDA process (the main() method in is run for a twitter user, the code will create a corresponding JSON file in viz/json. The JSON structure:

     "name": {topic_1_name},
        "name": {topic_1_word_1},
        "size": {topic_1_word_1_freq}
        "name": {topic_1_word_2},
        "size": {topic_1_word_2_freq}
        "name": {topic_1_word_3},
        "size": {topic_1_word_3_freq}


Now, we make slight modifications to the javascript code embedded in the given d3 visualization. Our goal is to be able to toggle between results for different twitter users; we’d like to switch from investigating the @nytimes topics to getting a sense of what @KingJames tweets about.

To do so, we add a drop-down to index.html, such that each time a user is selected on the drop-down, their corresponding JSON is loaded by the show() function in viz.js. Hence we also change the show() function to reload the visualization each time it is called.

Making The Visualizations Visible

To run the code locally, navigate to the viz/static directory and start an HTTP server to serve the content, e.g.

cd {project_root}/viz/static
python -m SimpleHTTPServer

then navigate to http://localhost:8000/index.html to see the visualization.

By selecting nytimes, we see the following visualization which gives a sense of the topics:

@nytimes topics

Upon clicking the ‘gaza’ topic, we see the top words that comprise the topic:

'gaza' topic

I’ve also used Heroku to put an example of the finished visualization with data from 10 different twitter usernames here:

Have fun exploring the various topics!

These Are Your Tweets on LDA (Part I)

How can we get a sense of what someone tweets about? One way would be to identify themes, or topics, that tend to occur in a user’s tweets. Perhaps we can look through the user’s profile, continually scrolling down and getting a feel for the different topics that they tweet about.

But what if we could use machine learning to discover topics automatically, to measure how much each topic occurs, and even tell us the words that make up the topic?

In this post, we’ll do just that. We’ll retrieve users’ tweets, and use an unsupervised machine learning technique called Latent Dirichlet Allocation (LDA) to uncover topics within the tweets. Then we’ll create visualizations for the topics based on the words that define them. Our tools will be Java, Twitter4J, and Mallet. All of the code is available on GitHub for reference.

As a sneak preview, here’s a visualization of a topic from @gvanrossum:

Screen Shot 2014-08-30 at 10.04.54 PM

First I’ll give an intuitive background of LDA, then explain some of the underlying math, and finally move to the code and applications.

What’s LDA?

Intuitively, Latent Dirichlet Allocation provides a thematic summary of a set of documents (in our case, a set of tweets). It gives this summary by discovering ‘topics’, and telling us the proportion of each topic found in a document.

To do so, LDA attempts to model how a document was ‘generated’ by assuming that a document is a mixture of different topics, and assuming that each word is ‘generated’ by one of the topics.

As a simple example, consider the following tweets:

(1) Fruits and vegetables are healthy.

(2) I like apples, oranges, and avocados. I don’t like the flu or colds.

Let’s remove stop words, giving:

(1) fruits vegetables healthy

(2) apples oranges avocados flu colds

We’ll let k denote the number of topics that we think these tweets are generated from. Let’s say there are k = 2 topics. Note that there are V = 8 words in our corpus. LDA would tell us that:

Topic 1 = Fruits, Vegetables, Apples, Oranges, Avocados
Topic 2 = Healthy, Flu, Colds

And that:

Tweet 1 = (2/3) Topic 1, (1/3) Topic 2
Tweet 2 = (3/5) Topic 1, (2/5) Topic 2

We can conclude that there’s a food topic and a health topic, see words that define those topics, and view the topic composition of each tweet.

Each topic in LDA is a probability distribution over the words. In our case, LDA would give k = 2 distributions of size V = 8. Each item of the distribution corresponds to a word in the vocabulary. For instance, let’s call one of these distributions \beta_{1}. It might look something like:

\beta_{1} = [0.4, 0.2, 0.15, 0.05, 0.05, 0.05, 0.05]

\beta_{1} lets us answer questions such as: given that our topic is Topic #1 (‘Food’), what is the probability of generating word #1 (‘Fruits’)?

Now, I’ll jump into the math underlying LDA to explore specifically what LDA does and how it works. If you still need some more intuition-building, see Edwin Chen’s great blog post. Or feel free to skip to the application if you’d like, but I’d encourage you to read on!

A Bit More Formal

LDA assumes that documents (assumed to be bags of words) are generated by a mixture of topics (distributions over words). We define the following variables and notation:

k is the number of topics.

V is the number of unique words in the vocabulary.

\theta  is the topic distribution (of length k ) for a document, drawn from a uniform Dirichlet distribution with parameter \alpha .

z_{n} is a topic 'assignment' for word w_{n}, sampled from p(z_{n} = i|\theta) = \theta_{i}.

\textbf{w} = (w_{1}, ... , w_{N}) is a document with N words.

w_{n}^{i} = 1 means that the word w_{n} is the i'th word of the vocabulary.

\beta  is a k \times V matrix, where each row \beta_{i}  is the multinomial distribution for the ith topic. That is, \beta_{ij} = p(w^{j} = 1 | z_{j} = i).

LDA then posits that a document is generated according to the following process:

1. Fix k and N.

2. Sample a topic distribution \theta from Dir(\alpha_{1}, ... , \alpha_{k}) .

\theta defines the topic mixture of the document, so intuitively \theta_{i} is the degree to which topic_{i} appears in the document.

3. For each word index n \in \left\{1,..., N\right\}:

4. Draw z_{n} from \theta.

z_{n} = i tells us that the word we are about to generate will be generated by topic i.

5. Draw a word w_{n} from \beta_{z_{n}}.

In other words, we choose the row of \beta based on our value of z_{n} from (4), then sample from the distribution that this row defines. Going back to our example, if we drew the “Food” row in step (4), then it’s more likely that we’ll generate “Fruits” than “Flu” in step (5).

We can see that this does in fact generate a document based on the topic mixture \theta , the topic-word assignments z, and the probability matrix \beta .

However, we observe the document, and must infer the latent topic mixture and topic-word assignments. Hence LDA aims to infer:

p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta) .

Coupling Problems

The story’s not over quite yet, though. We have:

p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta) = \frac{p(\theta, \textbf{z}, \textbf{w} | \alpha, \beta)}{p(\textbf{w} | \alpha, \beta)}

Let’s consider the denominator. I’m going to skip the derivation here (see pg. 5 of Reed for the full story), but we have:

p(\textbf{w} | \alpha, \beta) = \frac{\Gamma(\sum_{i=1}^{k}\alpha_{i})}{\prod_{i=1}^{k}\Gamma(\alpha_{i})} \int (\prod_{i=1}^{k}\theta_{i}^{\alpha_{i}-1})(\prod_{n=1}^{N}\sum_{i=1}^{k}\prod_{j=1}^{V}(\theta_{i}\beta_{ij})^{w_{n}^{j}})\,d\theta

We cannot separate the \theta and \beta, so computing this term is intractable; we must find another approach to infer the hidden variables.

A Simpler Problem

A workaround is to find a convex distribution that lower-bounds the distribution that we want to estimate. Then we can find an optimal lower bound to estimate the distribution that is intractable to compute. We simplify the problem to:

q(\theta, \textbf{z}|\gamma, \phi) = q(\theta|\gamma)\prod_{n=1}^{N}q(z_{n}|\phi_{n})

And minimize the KL-Divergence between this distribution and the actual distribution p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta) , resulting in the problem:

(\gamma^{*}, \phi^{*}) = argmin_{\gamma, \phi} D_{KL}(q(\theta, z|\gamma, \phi)||p(\theta, z|w, \alpha, \beta))

Since we also do not know \beta and \alpha, we use Expectation Maximization (EM) to alternate between estimating \beta and \alpha using our current estimates of \gamma and \phi, and estimating \gamma and \phi using our current estimates of \beta and \alpha.

More specifically, in the E-step, we solve for (\gamma^{*}, \phi^{*}), and in the M-step, we perform the updates:

\beta_{ij} \propto\sum_{d=1}^{M}\sum_{n=1}^{N_{d}}\phi^{*}_{dni}w^{j}_{dn}


log(\alpha^{t+1}) = log(\alpha^{t}) - \frac{\frac{dL}{d\alpha}}{\frac{d^{2}L}{d\alpha^{2}\alpha}+\frac{dL}{d\alpha}}

I’ve glossed over this part a bit, but the takeaway is that we must compute a lower bound of the actual distribution, and we use EM to do so since we have two sets of unknown parameters. And in the end, we end up with estimates of \theta, \textbf{z}, \beta as desired.

For more in depth coverage, see Reed’s LDA Tutorial or the original LDA paper.

Now, let’s move to the code.

The Plan

We’ll use @BarackObama as the running example. First we’ll download @BarackObama’s tweets, which will be our corpus, with each tweet representing a ‘document’.

Then, we’ll run LDA on the corpus in order to discover 50 topics and the top 20 words associated with each topic. Next, we will infer the topic distribution over the entire set of tweets. Hence we’ll be able to see topics and the degree to which they appear in @BarackObama’s tweets.

Finally, we’ll visualize the top 20 words for a given topic based on the relative frequency of the words.

I’ll walk through the important parts of the code, but I’ll skip the details in the interest of brevity. For the whole picture, check out the code on GitHub; running the main() method of will run the entire process and produce similar results to those shown below.

Getting the Tweets

To retrieve the tweets, we’ll rely on the Twitter4J library, an unofficial Java library for the Java API. The code found in is a wrapper that helps out with the things we need. The method that does the ‘work’ is

public List<String> getUserTweetsText(String username, int n)

which retrieves the last n of a user’s tweets and returns them as a List of Strings. In this method we access 1 page (200 tweets) at a time, so the main loop has n/200 iterations.

The highest-level method is

public void downloadTweetsFromUser(String username, int numTweets)

which calls getUserTweetsText() and saves the output to files. For organization’s sake, it saves the user’s tweets to

  • ./data/{username}/{username}_tweets.txt, which contains one tweet per line
  • ./data/{username}/{username}_tweets_single.txt, which contains all of the tweets on a single line. This second file will be used later to infer the topic distribution over the user’s entire set of tweets.

Hence we can download 3000 of @BarackObama’s tweets like so:

TwitterClient tc = new TwitterClient();
tc.downloadTweetsFromUser("BarackObama", 3000);

Hammering Out Some Topics

Now it’s time to take a Mallet to the tweets in order to mine some topics. Mallet is a powerful library for text-based machine learning; we can use its topic modeling through its Java API to load and clean the tweet data, train an LDA model, and output results. In order to use the Mallet API, you’ll have to follow the download instructions and build a jar file, or get it from this post’s GitHub repo.

The Mallet-related code that I’ll discuss next is found in

Loading the Data

The first step is to prepare and load the data. Our goal is to get the data in the {username}_tweets.txt file into an InstanceList object, i.e. a form that can be used by Mallet models.

To do so, we first create a series of “Pipes” to feed the data through. The idea is that each Pipe performs some transformation of the data and feeds it to the next Pipe. In our case, in

static ArrayList<Pipe> makePipeList()

we create a series of Pipes that will lowercase and tokenize the tweets, remove stop words, and convert the tweets to a sequence of features.

Then, in

static InstanceList fileToInstanceList(String filename)

we iterate through the input file, and use our Pipe list to modify and prepare the data, returning the InstanceList that we set out to build.

 Training Time

It’s training time. Using

public static ParallelTopicModel trainModel(InstanceList instances,
                                      int numTopics, int numIters,
                                      double alphaT, double betaW)

we train an LDA model called ParallelTopicModel on the InstanceList data. The betaW parameter is a uniform prior for \beta , and the alphaT parameter is the sum of the \alpha parameter; recall from the math section that \beta is the numtopics \times vocabsize matrix that gives word probabilities given a topic, and \alpha is the parameter for the distribution over topics.

Looking at the Output, Part I

With a trained model, we can now look at the words that make up the topics using \beta , and the composition \theta_i for tweet_i .

Printed to stdout, we see the 50 topics, with the top 20 words for each topic. The number attached to each word is the number of occurrences:

Topic 0 
families:13 republican:11 read:7 fast:5 ed:5 op:5 family:5 marine:4 democrat:4 efforts:4 policies:4 story:4 pendleton:3 camp:3 explains:3 military:3 #cir:3 california:3 #familiessucceed:3 workplace:3
Topic 1 
president:175 obama:165 middle:61 class:59 jobs:47 #abetterbargain:37 economy:37 good:32 growing:20 #opportunityforall:20 isn:19 #rebuildamerica:18 today:18 create:17 watch:17 infrastructure:16 americans:16 plan:15 live:15 american:15

Topic 48 
president:72 address:62 obama:57 watch:56 weekly:49 opportunity:15 economic:15 discusses:14 issue:11 importance:10 working:10 week:10 speak:9 budget:8 discuss:8 congress:7 calls:6 building:6 #opportunityforall:6 lady:5
Topic 49 
discrimination:19 lgbt:17 rights:15 #enda:15 americans:14 law:10 act:10 today:10 thedreamisnow:7 screening:7 voting:7 protections:7 basic:7 stand:7 add:7 american:7 anniversary:6 workplace:6 workers:6 support:6

Each box corresponds to taking a row of \beta , finding the indices with the 20 highest probabilities, and choosing the words that correspond to these indices.

Since each tweet is a document, the model also contains the topic distribution for each tweet. However, our goal is to get a sense of the overall topic distribution for all of the user’s tweets, which will require an additional step. In other words, we’d like to see a summary of the major topics that the user tends to tweet about.

Getting An Overall Picture

To do so, we will use our trained model to infer the distribution over all of the user’s tweets. We create a single Instance containing all of the tweets using

singleLineFile = {username}_tweets_single.txt

and find the topic distribution:

double[] dist = inferTopicDistribution(model, 

We can then look at the distribution in  ./data/{username}/{username}_composition.txt:

0 0.006840685214902167
1 0.048881207831654686
29 0.09993216645340489
48 0.022192334924649955
49 0.01473112438501846

We see, for instance, that topic 29 is more prominent than topic 0; specifically, the model inferred that more words were generated by topic 29 than topic 0.

In ./data/{username}/{username}_ranked.txt we have the top 10 topics, ranked by composition, along with each topic’s top words. For instance, at the top of the file is:

Topic 29:

This topic could probably be labeled as “presidential”; a topic we’d expect to find near the top for @BarackObama.

Looking on, we see a topic that is clearly about healthcare:

Topic 24

and one about climate change:

Topic 28

The inferred topics are pretty amazing; a job well done by LDA. But while viewing words and their frequencies may be fun, let’s visualize a topic in a nicer way.

Into the Clouds

We now have the words that make up the most prominent topics, along with the frequency of each word. A natural visualization for a topic is a word cloud, which allows us to easily see the words and their relative weights.

It turns out that a word-cloud generator named Wordle can create word clouds given a list of weighted words…exactly the format found in ./data/{username}/{username}_ranked.txt !

Let’s copy the word:frequency list for Topic 29 and throw it into Wordle (dialing down obama and president to 200):

and the results:

Topic 29 – “Presidential”

Topic 24 – “Healthcare”

Topic 28 – “Climate Change”

A Brief Pause

Let’s pause for a second. This is a nice moment and a beautiful result.

Take a look at the words : LDA inferred a relationship between “country”, “america”, and “obama”. It grouped “insurance”, “#obamacare”, and “health”. It discovered a link between “climate”, “deniers”, and “#sciencesaysso”.

Glance back up at the math section. We never told it about presidents, countries, or healthcare. Nowhere in there is there a hard-coded link between climate words and science hash tags. In fact, it didn’t even know it would be dealing with words or tweets.

It’s ‘just’ an optimization problem, but when applied it can discover complex relationships that have real meaning. This is an aspect that I personally find fascinating about LDA and, more generally, about machine learning.

As a next step, feel free to download the code and try out other usernames to get a sense of LDA’s generality; its not just limited to @BarackObama, climate, or healthcare.

Next Steps: From One to Many

Currently, we have a nice way of viewing the content of one topic, in isolation. In the next post, we’ll develop a visualization using d3.js for all of the top topics at once. We’ll be able to see and compare topics side by side, and obtain a higher level view of the overall topic distribution. As a sneak preview, here’s a visualization of the top 10 topics from @nytimes:

@nytimes top 10

Credits and Links

Much of the mathematical content is derived from Reed’s Tutorial and the LDA Paper (or the shorter version). These are also great resources for learning more. Edwin Chen’s blog post also provides a good introduction and an intuition-building example. The Mallet developer’s guide and data importing guide provide good examples of using the Mallet API. The Programming Historian has a great intro to using Mallet for LDA from the command line.

Eigenfaces and Forms

Plato’s Theory of Forms contains the idea that all objects have an abstract ‘form’; an underlying ‘essence’ shared by all objects of a certain type. By Plato’s theory, all horses have an innate ‘horseness’ despite differences in outward appearance, and all apples have a distinct ‘appleness’ that lets us group a Granny Smith and a Red Delicious into a single ‘apple’ category. Extending this idea to human faces, all faces would have a shared, abstract structure called ‘faceness’.

There are arguments against Plato’s theory, but it still invites an interesting analogy with an idea in machine learning called eigenfaces. Could we use machine learning to uncover abstract, common parts underlying a human face?

Let’s start with face images. Is there an abstract form or a set of common, abstract ingredients shared by all faces? If there are, how could we use these common ingredients to construct two completely different faces?

This post will look into these questions, and although it may not reach definitive conclusions about Platonic forms, it does contain some pretty cool gifs at the end. We’ll use a machine learning technique called principal component analysis (PCA) and apply it to a dataset of face images.

By using PCA, we can represent data in lower dimensions by expressing it as a combination of its most ‘important’ directions. In the space of face images, these ‘important’ directions are called eigenfaces. Intuitively, each of the eigenfaces explains some of the variance between face images. Once we compute the eigenfaces, we can use them to compactly represent and reconstruct the original face images.

To perform the investigation, our tools will be Python, numpy, and scipy. All of the code, as well as some sample results, can be found on GitHub.

Representing the Images

The first step is preparing the images. First, we choose one image of each person from the faces94 dataset, and convert it to greyscale. By converting to greyscale, each pixel in the image can be represented as a single number. We represent the entire image set as an n x p matrix X, with n images and p pixels; each row of the matrix corresponds to an image’s pixels.

X = numpy.array([imread(i, True).flatten() for i in jpgs])

Performing PCA

Now, with the images loaded comes the core of the work: performing the principal component analysis. First, we find the mean face image and subtract it from each face in the dataset. This centers the data at the origin.

# compute the mean face
mu = numpy.mean(X, 0)
# mean adjust the data
ma_data = X - mu

The Mean Face


Now, we decompose the mean-centered matrix into three parts using singular value decomposition (SVD). SVD decomposes the face matrix into three parts, UΣV, where U and V are the left and right singular vectors of X, respectively, and Σ is a diagonal matrix whose elements are the singular values of X.

U, S, V = linalg.svd(ma_data.transpose(), full_matrices=False)

Now, the columns of U form an orthonormal basis for the eigenspace of X‘s covariance matrix. What significance does this have for our investigation? Each column of U is a singular vector, corresponding to an eigenface! The corresponding singular values are found in Σ, and are decreasing from left to right. Visually, here are the first three columns of U, which are the first three eigenfaces:


Eigenface 1


Eigenface 2


Eigenface 3

How Many Eigenfaces?

Our dataset consists of only n = 117 training images, and each image has p = 180*200 = 36,000 pixels. Since n < p we observe that SVD will return only n eigenfaces with non-zero singular values; therefore we have = 117 different eigenfaces.

Hauntingly Important Faces

The eigenfaces are abstract – and scary – faces. Intuitively, we can think of each eigenface as an ‘ingredient’ common to all faces in the dataset. A person’s face is constructed using some combination of the eigenface ingredients. In fact, all of the faces in X contain some amount of each eigenface- perhaps these ingredients are the abstract, common structure we were looking for!

Some eigenfaces are more ‘important’ to reconstructing faces than others. The columns of U are ordered in terms of decreasing importance, since their corresponding singular values in Σ decrease from left to right. The first column, corresponding to the first eigenface, is the most ‘important’; it explains the most variance in the space of faces. The second column, corresponding to the second eigenface is the second most ‘important’, the third column, corresponding to the the third eigenface is the third most ‘important’, and so on.

The first eigenface:


Eigenface 1

is more important than the 103rd eigenface:


Eigenface 103

We can reconstruct a face by using these eigenface ingredients; each face is just a weighted combination of the ingredients. We obtain the weights by dotting the mean centered data with the eigenfaces:

# weights is an n x n matrix
weights =, e_faces)

For each image, we have a weight for each of the n eigenfaces, so weights is an n x n matrix. To reconstruct a face, we dot the face’s weights (a row in the weights matrix) with the transpose of the eigenfaces (n x p matrix), and add the mean face back in:

# reconstruct the face located at img_idx
recon = mu +[img_idx, :], e_faces.T)



Original Image

We’ve reconstructed a face by dotting the face’s weights with each eigenface. But how is this useful? Let’s consider what we used to reconstruct the image:

  • a p x n matrix of eigenfaces (e_faces)
  • n x 1 vector of weights (one row from weights)

The eigenfaces matrix stays constant for all reconstructions; to reconstruct a new face we simply supply a weight vector. Therefore a face is uniquely defined by a n x 1 vector. Keep in mind that n = 117 and p = 36000; instead of using all 36000 pixels to distinguish one image from another, we now only use 117 weights!

But it gets better; in fact we’ve yet to fully take advantage of PCA’s dimensionality reduction. So far, we have reconstructed the face using all n weights and n eigenfaces. However, we observed that the initial eigenfaces explain more variance between faces than the later eigenfaces.

Leaner Reconstruction

We can take advantage of this observation, and remove the ‘unimportant’ eigenfaces; instead of using all n, we can instead use just the first k weights and eigenfaces, where k < n. By doing so we’re representing the face image using fewer dimensions, and we can reconstruct an approximation of the face using only this lower-dimensional data. To illustrate, let’s choose k = 50 and reconstruct the image:

# reconstruct the image at img_idx using only 50 eigenfaces
k = 50
recon = mu +[img_idx, 0:k], e_faces[:, 0:k].T)

Reconstruction with 50 eigenfaces

This time, to reconstruct the image all we needed was (keeping in mind that k < p):

  • p x k matrix of eigenfaces (k columns of e_faces)
  • a k x 1 vector of weights (columns of one row from weights)

However, the picture indicates a tradeoff: the fewer eigenfaces we use, the rougher the reconstruction, and the more eigenfaces we use, the closer we get to the original picture. In other words, when we reduce k, we give up some reconstruction accuracy. But this is where eigenfaces get fascinating. Since each eigenface is less and less important to the reconstruction, we reach something resembling the original picture fairly quickly; in our case it’s never necessary to use all 117 eigenfaces! To illustrate, let’s consider the following image.

The original image is:


The image reconstructed using only 1 eigenface is:


k = 1
recon = mu +[img_idx, 0:k], e_faces[:, 0:k].T)

It’s still very abstract.

Using 20 eigenfaces, we have:


k = 20
recon = mu +[img_idx, 0:k], e_faces[:, 0:k].T)

The picture is becoming more accurate; we can begin to make out his glasses, hair, and face shape.

At just 40 eigenfaces, we already have a picture that is clearly the same individual that we started with:


And at 50:


Reconstructing another person’s face, we have:


1 Eigenface


20 Eigenfaces


40 Eigenfaces


50 Eigenfaces


80 Eigenfaces

It’s fascinating that the eigenfaces stay the same from person to person; we simply change the weightings and have a new identity. Again, if we just store one copy of the eigenfaces matrix, we can reconstruct a person’s picture using just k numbers. We can also compare two faces using just dimensions instead of p.

To further illustrate the idea, here are some animations. The animations start with the mean face, then progressively reconstruct a face, using an increasing number of eigenfaces. Specifically, each frame uses one eigenface more than the previous frame. At first we see massive changes from frame to frame, since the early eigenfaces are the most informative. Towards the end we see that using an additional eigenface only makes a minor change; more eigenfaces are being used, but the face looks roughly the same.


Back to the Forms

We began with a vague search for an abstract, common ‘form’ for face images. By computing eigenfaces, we created a set of shared ‘ingredients’ that define face images. Starting from an abstract image (the mean face), we are able to add combinations of the ingredients to reconstruct completely different faces.

Perhaps we could interpret the eigenfaces as defining the underlying, abstract ‘faceness’ in each face image. The same procedure could be applied to pictures of apples to visualize ‘appleness’, or pictures of horses for ‘horseness’.

Regardless, eigenfaces are a great intuition-builder and a fascinating way to visualize PCA. Eigenfaces have applications in facial recognition, and the more general PCA has a vast range of applications throughout machine learning.


This post was inspired by Jeremy Kun’s eigenface post on his great blog, Math ∩ Programming, Penn Machine Learning lecture notes, and a presentation about eigenfaces. The dataset used was discussed in Kun’s post, and is found here.

AI’s Future with Andrew Ng

I ran across this short video of Andrew Ng from Stanford giving a broad, high level overview of developments in Artificial Intelligence. It’s fascinating to hear how neural networks led to a shift in the way he views AI’s future. At the end, he also briefly mentions his motivations for pursuing general AI: eliminating mundane tasks and freeing up time for ‘higher endeavors’.