ECIR 2016 Paper and Presentation

I recently presented a paper entitled Efficient AUC Optimization for Information Ranking Applications at the European Conference for Information Retrieval (ECIR) in Padua, Italy.

In the paper, we derive gradient approximations for optimizing area under an ROC curve (AUC) and multi-class AUC using the LambdaMART algorithm. Here are slides for the talk, which give an overview of the paper and a brief review of Learning to Rank and LambdaMART:

The goal was to expand LambdaMART to optimize AUC in order to add to the portfolio of metrics that the algorithm is currently used for. The approach was to derive a “λ-gradient” analogous to those that have been defined for other metrics such as NDCG and Mean Average Precision.

This in turn required an efficient way of computing a key quantity, namely the change in the AUC from swapping two items in the ranked list. One contribution of the paper is a simple, efficient formula for computing this quantity (along with a proof), which appears in the paper as:


The paper also contains experimental results measuring the performance of LambdaMART using the derived gradients.

Peering into the Black Box : Visualizing LambdaMART

In the last post, I gave a broad overview of the Learning to Rank domain of machine learning that has applications in web search, machine translation, and question-answering systems. In this post, we’ll look at a state of the art model used in Learning to Rank called LambdaMART. We’ll take a look at some math underlying LambdaMART, then focus on developing ways to visualize the model.

LambdaMART produces a tree ensemble model, a class of models traditionally viewed as ‘black boxes’ since they take into account predictions of 10’s or 100’s of underlying trees. While viewing a single decision tree is intuitive and interpretable, viewing 100 at once would be overwhelming and uninformative:



Single Tree


Ensemble of Trees

 Ensemble of Trees

By moving from a single tree to an ensemble of trees, we tradeoff interpretability for performance.

In this post, we’ll take a look at the trees produced by the LambdaMART training process, and develop a visualization to view the ensemble of trees as a single collective unit. By doing so, we’ll gain back some of the intuitive interpretability that make tree models appealing. Through the visualization, we’ll peer into the black box model and gain some sense of the factors that the model uses to make predictions.

We use Java and the JForests library to train LambdaMART and parse its output, and d3.js for visualization. To get a sneak peek, the final visualization is here.

LambdaMART Overview

Let’s take a look at some details of LambdaMART. For the full story, check out this paper from Microsoft Research.

At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG. To understand LambdaMART we’ll look at two aspects: Lambda and MART.


LambdaMART is a specific instance of Gradient Boosted Regression Trees, also referred to as Multiple Additive Regression Trees (MART). Gradient Boosting is a technique for forming a model that is a weighted combination of an ensemble of “weak learners”. In our case, each “weak learner” is a decision tree.

Our goal is to find a function f(x) that minimizes the expected loss L:

\hat{f}(x) = \arg\min_{f(x)}E[L(y, f(x))|x]

for feature vectors x and labels y.

To do so we first view \hat{f}(x) as a sum of weak learners f_m:

\hat{f}(x)=\sum_{m=1}^M f_{m}(x)

Since we are dealing with decision trees, evaluating f_{m}(x) corresponds to passing x down the tree f_{m} until it reaches a leaf node l. The predicted value is then equal to a parameter \gamma_l.

Decision trees consist of two types of parameters: region assignments R_i that assign a training example x_i to a leaf node l, and leaf outputs \gamma_l that represent the tree’s output for all examples assigned to region R_l. Hence we can write each weak learner f_{m}(x) as a parameterized tree h_{m}(x;R_{m}, \gamma_{m}), giving:

\hat{f}(x)=\sum_{m=1}^M f_{m}(x)=\sum_{m=1}^M h_{m}(x; R_{m}, \gamma_{m})

Gradient Boosting finds each tree in a stepwise fashion. We start with an initial tree f_1, then step to another tree f_2, and so on:

\hat{f}_{1}(x) = f_{1}

\hat{f}_{2}(x) = f_{1} + f_{2}

\hat{f}_{M}(x) = \sum_{m=1}^{M}f_{m}

How do we determine the steps? At each step, we’d like the model to change such that the loss decreases as much as possible. The locally optimal decrease corresponds to the gradient of the loss with respect to the current model m:

g_{im}=\frac{\partial L[f(x_i), y_i]}{\partial f(x_i)}

Hence we define the step as f_m=-\rho_mg_m. This gives us insight into how Gradient Boosting solves the minimization – the algorithm is performing Gradient Descent in function space.

Now, we know that we want to take a gradient step, but still haven’t said how that translates to finding a tree. To do so, we build a tree that models the gradient; by adding such a tree to the ensemble, we effectively take a gradient step from the previous model. To fit the tree, we use squared error loss, giving the minimization problem:

argmin_{R, \gamma}\sum_{i=1}^N(-g_{im}-F(x_i;R, \gamma))^2

Where F(x_i;R, \gamma) denotes a regression tree with parameters R, \gamma.

Let’s take a step back. We’ve just set up a framework for finding an ensemble of trees that, when added together, minimizes a loss function. It’s a “framework” in the sense that we can use it if we merely supply gradients of the loss function at each training point. This is where the “Lambda” part of LambdaMART comes into play.

For further reading, Chapter 10 of Elements of Statistical Learning provides a great and thorough overview. This paper also provides a good intro of Gradient Boosting.


In ranking, the loss function that we’ll most likely care about optimizing is probably either NDCG, MAP, or MRR. Unfortunately, these loss functions aren’t differentiable at all points, so we can’t use them directly in our Gradient Boosting “framework”, since it’s unclear how we can provide the gradients at each training point.

To address this, LambdaMART uses an idea from a model called LambdaRank. For each training point pair i,j, we compute a value \lambda_{i,j} that acts as the gradient we need. Intuitively, \lambda_{i,j} can be thought of as a force that moves documents up and down the ranked list. For instance, if document i is ranked lower than document j, then document i will receive a push of size |\lambda_{i,j}| downwards in the ranked list, and j will be pushed upwards the same amount. The LambdaMART paper reports that empirically, the \lambda‘s have been used successfully to optimize NDCG, MAP, and MRR.

Hence in LambdaMART we use the Gradient Boosting framework with the \lambda_i‘s acting as the gradients g_i. This leads to update rules for the leaf values \gamma_l that depend on the \lambda values of the training instances that fall in leaf l.

Visualizing Gradient Boosting

We’ve observed that LambdaMART trains an ensemble of trees sequentially. What does this sequence of trees look like? Let’s look at an ensemble of 19 trees trained with LambdaMART on the OHSUMED dataset:


At each inner node, the small number denotes the feature label, and the larger number denotes the threshold. The number at each leaf node denotes the leaf output.

We can see that each tree is somewhat similar to the previous tree. This makes sense given that each tree t_i is dependent on the model state F_{i-1} since it is fit to the gradient with respect to that model’s gradient.

The step-by-step animation gives us insight into what is happening during training. It’s a great way to visualize and gain an intuition of the gradient boosting process – we see consecutive trees being produced that depend on the previous tree. However, we are still viewing the LambdaMART model as a group of separate trees; we still need to develop a way of viewing the final model cohesively.

Does the sequence animation tell us anything that could help with developing a single, cohesive model view? We can start with the observation that often, the trees look “roughly similar”.

Sometimes two trees will have the same feature at a given node position; for instance the root favors features 7, 9, and 11. Trees may not even change from step-to-step or may share a similar overall structure, such as trees #3 and #4 – notice how the numbers change, but the structure remains the same. Finally, all of the trees only draw from a subset of the total feature set; for instance feature 4 is never used by any of the trees.

Grouping Similar Nodes

Given these observations, we can think about how to build a unified visualization that takes advantage of reused features and similar structure between trees. Let’s consider the root node. We can count the frequency of each feature that occurs as the root node across the entire ensemble, leading to a single node that shows that feature 7 was used as the feature for the root node for 3 trees, feature 9 for 5 trees, and so on:


At each tree level h, there are 2^h possible node positions. We can collect the feature frequencies for each of these positions across the ensemble. If a position doesn’t appear in a tree, we mark it as ‘DNE’, and if a position is a leaf, we mark it as ‘Leaf’, such as position (3, 6):


Heatmap Tree

At each node, we have a list of feature (or DNE or Leaf) counts. Heatmaps provide a natural way of visualizing this count information. We can make each node of the ordered-pair-tree into a heatmap, giving rise to the “Heatmap Tree”:


The Heatmap Tree

To view the interactive d3.js visualization, see this link.

The Heatmap Tree shows the entire ensemble as a single unit. Each number is a feature, and the color scale tells us how many trees use that feature at the given node:

Heatmap Detail

By looking at the Heatmap Tree we can get a sense of which features the tree uses when it classifies instances. The ensemble uses features 7, 9, 11, 25, and 29 to perform the initial split, with 11 being used the most often. Further down the tree, we see features 7 and 9 again, along with common features such as 33 and 37. We can easily see that most of the trees in the ensemble are between 5 and 7 levels deep, and that while there are 46 total features, at a given node location the most variation we see is 9 features.

The Heatmap Tree can bring together hundreds of trees, such as this visualization of a 325 tree ensemble:


Tuning Parameters

The Heatmap Tree also lets us see how tuning parameters affect the final model. For instance, as we decrease the learning rate from 0.10 to 0.001, we see the ensemble size fluctuate:


Learning Rate 0.10


Learning Rate 0.01


Learning Rate 0.001

Notice how in the 0.01 case, the Heatmap Tree concisely summarized a 111-tree ensemble.

When we use feature subsampling, we see the number of features at each node increase (in general). Each tree has a different limited subset of features to choose from, leading to a spread in the overall distribution of chosen features:


Feature subsampling 0.6

Feature subsampling and training data subsampling makes this ensemble more crowded:


Feature subsampling 0.6, Data subsampling 0.6

Note that these parameter trends do not necessarily generalize. However, the Heatmap Tree captures all of the trees and features in a single structure, and gives us insight into the structural results of our parameter choices.


The Heatmap Tree, unfortunately, has its limits. With wide variation of features and many leaves, the tree becomes crowded:


Since the number of possible node locations at a given level increases exponentially with height, the tree also suffers when trying to visualize deep trees.


Another nice aspect of decision trees is that we can visualize how a test instance gets classified; we simply show the path it takes from root to leaf.

How could we visualize the classification process in an ensemble, via the Heatmap Tree or otherwise? With the Heatmap Tree, we would need to be able to simultaneously visualize 10’s or 100’s of paths, since there would be an individual path for every tree in the ensemble. One idea is to have weighted edges on the Heatmap Tree; an edge would become thicker each time the edge is used when classifying an instance.

Another next step is to test the generalizability of this visualization; could it work for any gradient boosting model? What would a Heatmap Tree of a random forest look like?


We’ve taken a close look at LambdaMART and gradient boosting. We’ve devised a way to capture the complexity of gradient boosted tree ensembles in a cohesive way. In doing so we bought back some of the interpretability that we lost by moving from a single tree to an ensemble, gained insight into the training process, and made the black box LambdaMART model a bit more transparent.

To see an interactive d3 Heatmap Tree, visit this link.

References and Further Reading

From RankNet to LambdaRank to LambdaMART: An Overview

Gradient Boosting Machines: A Tutorial

“Tree Ensembles for Learning to Rank”, Yasser Ganjisaffar 2011 PhD Thesis

Learning to Rank Overview

How does a search engine rank the results that it returns?

How does a question and answering system select a final answer?

How does a translation system ultimately choose a translation for some text?

In the next few posts, we will gain insight into these questions by exploring a sub-domain of machine learning called Learning to Rank. Each of the three questions has a common last step, where an algorithm takes a list of ‘candidates’ and produces a ranked list of results.

In the translation case, we give a system of string of text to translate, and it produces, say, 20 potential candidate translations. Then a Learning to Rank model steps in, ranks those 20 candidates, and returns the top candidate.

In the question-answering case, we give a system a query string, and it produces, say, 100 potential answers. Again we use a Learning to Rank model to produce a ranking of the potential answers, and select the top one as the final answer.

In the search engine case, we type in a search query, and the search engine produces, say, 1000 candidate search results. The order of the final results that you eventually see are produced by a Learning to Rank model that ranks those 1000 candidate results.

To explore the details, we’ll first define the problem addressed in the Learning to Rank domain and survey the methods, datasets, and metrics that are used. Then in the next posts, we’ll look in depth at a state of the art model called LambdaMART, and devise a way of visualizing its training process and the resulting model. Along the way we’ll also develop wrapper code for running a major learning to rank library called JForests and discuss other problems tackled by the field.

Learning to Rank

As a first step, we’ll define the problem. While learning to rank has numerous applications including machine translation, QA, and even spell checking, we’ll use search engine document ranking as a running example, mainly due to availability of datasets.

Problem definition

In the context of document retrieval, consider a search engine that accepts a query, and produces a ranking of documents related to the query using a technique such as TF-IDF scoring. Suppose we take the top 100 documents. The goal of learning to rank is to produce a better ranking of these 100 candidate documents for the given query.

Most research has posed learning to rank as a supervised learning problem, although there has been work in semi-supervised ranking. We’ll only focus on the supervised problem here. Given a list of queries and candidate documents with annotated relevance scores, the learning to rank model learns to predict the relevance of candidate documents for a new query, and derives a ranking from the predictions. Let’s define that a bit more precisely.

More precise problem definition

The data consists of queries, documents, and relevance scores. Let Q denote the set of queries, D denote the set of documents, and Y denote the set of relevance scores.

For each query q_{i} \in Q we have a list of m candidate documents D_{i}={d_{i,1},d_{i,2},...,d_{i,m}} . We want to produce a ranking for the documents D_{i}, specifically a permutation \pi_{i} of the document indices \{1,2,...,m\}. To do so, we want to learn a function f(q,D) that produces an optimal permutation \pi_{i}.

As we’ll see, this ranking is produced in different ways; for instance, some models derive the rankings by producing a score for each (q_{i},d_{i,j}) pair, while others learn to tell whether the rank of (q_{i},d_{i,j}) is higher than (q_{i},d_{i,k}).

Training Set

In the standard training set, each document d_{i,j} is assigned a relevance score y_{i,j} that says how relevant the document is to query q_{i}. These scores can be assigned by human annotators, or derived implicitly from sources such as click through data.

We can think of the training set as consisting of 3-tuples (q_{i}, d_{i,j}, y_{i,j}); one training example consists of a query, a candidate document, and a measure of how relevant that document is to the query. An example labeling is y_{i,j} \in \{1, 2, 3, 4, 5\}, where 1 means “not relevant” and 5 means “very relevant”.

For instance, we could have a query q_{1} = “Leo Tolstoy famous novel”, with candidate documents D_{1}=\{d_{1,1}=Steppenwolf, d_{1,2}=War\ and\ Peace, d_{1,3}=The\ Cossacks\}, and relevance labels y_{1,1}=1,y_{1,2}=5,y_{1,3}=3.

Each pair (q_{i}, d_{i, j}) is actually a feature vector \phi(q_{i}, d_{i, j}), so the final training set is defined as T=\{(\phi(q_{i}, d_{i, j}),y_{i,j})\} for i=1...n,\ j=1...m, where n is the number of queries, and m is the number of candidate documents for each query.

A concrete training set

To make this concrete, here’s a few training examples taken from the OHSUMED dataset:

0 qid:1 1:1.000000 2:1.000000 3:0.833333 4:0.871264 5:0 6:0 7:0 8:0.941842 9:1.000000 10:1.000000 11:1.000000 12:1.000000 13:1.000000 14:1.000000 15:1.000000 16:1.000000 17:1.000000 18:0.719697 19:0.729351 20:0 21:0 22:0 23:0.811565 24:1.000000 25:0.972730 26:1.000000 27:1.000000 28:0.922374 29:0.946654 30:0.938888 31:1.000000 32:1.000000 33:0.711276 34:0.722202 35:0 36:0 37:0 38:0.798002 39:1.000000 40:1.000000 41:1.000000 42:1.000000 43:0.959134 44:0.963919 45:0.971425 #docid = 244338
2 qid:1 1:0.600000 2:0.600000 3:1.000000 4:1.000000 5:0 6:0 7:0 8:1.000000 9:0.624834 10:0.767301 11:0.816099 12:0.934805 13:0.649685 14:0.680222 15:0.686762 16:0.421053 17:0.680904 18:1.000000 19:1.000000 20:0 21:0 22:0 23:1.000000 24:0.401391 25:0.938966 26:0.949446 27:0.984769 28:0.955266 29:1.000000 30:0.997786 31:0.441860 32:0.687033 33:1.000000 34:1.000000 35:0 36:0 37:0 38:1.000000 39:0.425450 40:0.975968 41:0.928785 42:0.978524 43:0.979553 44:1.000000 45:1.000000 #docid = 143821
0 qid:1 1:0.400000 2:0.400000 3:0.555555 4:0.563658 5:0 6:0 7:0 8:0.545844 9:0.380576 10:0.427356 11:0.468244 12:0.756579 13:0.366316 14:0.360838 15:0.373909 16:0.210526 17:0.479859 18:0.595237 19:0.608701 20:0 21:0 22:0 23:0.613865 24:0.184562 25:0.791539 26:0.863833 27:0.957024 28:0.896468 29:0.941132 30:0.946305 31:0.232558 32:0.507810 33:0.603068 34:0.616847 35:0 36:0 37:0 38:0.614004 39:0.202374 40:0.812801 41:0.868091 42:0.958879 43:0.926045 44:0.944576 45:0.963753 #docid = 285257


0 qid:63 1:0.142857 2:0.162076 3:0.250000 4:0.250000 5:0 6:0 7:0 8:0.167856 9:0.078385 10:0.103707 11:0.132891 12:0.419191 13:0.027399 14:0.027300 15:0.027300 16:0.000000 17:0.000000 18:0.000000 19:0.000000 20:0 21:0 22:0 23:0.000000 24:0.000000 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.000000 30:0.000000 31:0.026316 32:0.063236 33:0.444444 34:0.407079 35:0 36:0 37:0 38:0.189973 39:0.011360 40:0.068814 41:0.066431 42:0.287984 43:0.188175 44:0.000000 45:0.046177 #docid = 173315
2 qid:63 1:0.000000 2:0.000000 3:0.000000 4:0.000000 5:0 6:0 7:0 8:0.000000 9:0.000000 10:0.000000 11:0.000000 12:0.000000 13:0.000000 14:0.000000 15:0.000000 16:0.000000 17:0.000000 18:0.000000 19:0.000000 20:0 21:0 22:0 23:0.000000 24:0.000000 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.000000 30:0.000000 31:0.000000 32:0.000000 33:0.000000 34:0.000000 35:0 36:0 37:0 38:0.000000 39:0.000000 40:0.000000 41:0.000000 42:0.000000 43:0.000000 44:0.000000 45:0.000000 #docid = 9897

We can see that the relevance scores y_{i,j} (the first number of each line) are between 0 and 2. Looking at the original OHSUMED text, we see that the first example is (y_{1,1},\phi(q_{1}, d_{1, 1})) with:

q_{1}: “Are there adverse effects on lipids when progesterone is given with estrogen replacement therapy”

d_{1,1}= #docid 244338 = “Effects on bone of surgical menopause and estrogen therapy with or without progesterone replacement in cynomolgus monkeys.”

y_{1,1} = 0 = not relevant

and the second example is (y_{1,2},\phi(q_{1}, d_{1, 2})) with:

d_{1,2}= #docid 143821 = “Cardiovascular benefits of estrogen replacement therapy.”

y_{1,1} = 2 = very relevant

The text has been converted into the features described on pg.11 of this paper. Note that I only showed the document title here and omitted the body text.

Test Instance

At test time, we receive a query q and a list of m candidate documents D={d_{1},d_{2},...,d_{m}} . We evaluate f(q,D) and return a ranking of the documents D, specifically a permutation \pi of the document indices \{1,2,...,m\}.

Learning to Rank Datasets

There are several benchmark datasets for Learning to Rank that can be used to evaluate models.

Microsoft Research released the LETOR 3.0 and LETOR 4.0 datasets. LETOR 3.0 contains data from a 2002 crawl of .gov web pages and associated queries, as well as medical search queries and medical journal documents from the OHSUMED dataset. LETOR 4.0 contains queries from the Million Query Track from TREC and document ids from Gov2, another crawl of .gov web pages. Data is represented as raw feature vectors; information about the features used and the datasets is found here. One thing that I like about the OHSUMED data is that the original text of the queries and documents is available, which can be helpful for interpreting model output or deriving new features. Unfortunately obtaining the original GOV data is nontrivial.

Microsoft Research also released a larger scale dataset consisting 30,000 queries, candidate features, and relevance judgements derived from the Bing search engine. The MSLR-WEB30K dataset contains all of the queries, while a smaller MSLR-WEB10K dataset contains a 10,000 query sample of MSLR-WEB30K. Unfortunately, we are unable to retrieve the original query text and URLs to further interpret results, but the dataset is nevertheless very useful for evaluating a model and producing metrics.

Yahoo sponsored a Learning to Rank competition in 2010 and released a dataset consisting of 36,000 queries and 883,000 document IDs. The datasets are available here, but require a university email to download. Similar to MSLR-WEB, we receive the raw features and relevance scores, and are unable to see the text of the query or URL.

Learning to Rank Evaluation Metrics

When we run a learning to rank model on a test set to predict rankings, we evaluate the performance using metrics that compare the predicted rankings to the annotated gold-standard labels. Before reviewing the popular learning to rank metrics, let’s introduce notation.

We’ll assume that there are n queries, and that for each query q_{i} we have a list of m candidates D_{i} =\{d_{i,1},...d_{i,m}\} with annotated relevance scores y_{i,1},...y_{i,m}. The ranker produces a ranking \pi_i, so that \pi_{i,j} represents the predicted ranking of document d_{i,j}.

Intuitively, we would like our ranking to have high relevance scores, with the highest scores receiving the highest rankings. Now we’ll look at several metrics that capture these intuitions.

Normalized Discounted Cumulative Gain (NDCG)

Normalized Discounted Cumulative Gain (NDCG) is a popular learning to rank metric that relies on Cumulative Gain and Discounted Cumulative Gain.

Cumulative Gain

First we’ll look at Cumulative Gain. Suppose we have:

\pi_i = \{d_{i,1}, d_{i,2}, d_{i,3}, d_{i,4}\}

y_{i}= \{5, 2, 5, 0\}

That is, the top ranked document has a relevance score of 5, the second ranked document has a relevance score of 2, and so on.

The cumulative gain is simply the sum of the relevance scores:


Notice that this doesn’t take ranking order into account; if our ranking was reversed:

\pi_i = \{d_{i,4}, d_{i,3}, d_{i,2}, d_{i,1}\}

it would receive the same cumulative gain score:


Discounted Cumulative Gain

Discounted Cumulative Gain (DCG) takes order into account by ‘rewarding’ high relevance scores that appear in high ranks. A high relevance score appearing at a low rank receives a lower ‘reward’. To compute the metric, we use:

DCG_k = \sum_{j=1}^{k} \frac{2^{y_{i,\pi_{i,j}}}-1}{log(i+1)}=31+1.9+15.5+0=48.4

We can see that in position 1, the high relevance score 5 is worth 31, while in position 3 it’s worth 15.5. The lower relevance score 2 is only worth 1.9 in position 2. With DCG order matters; the score would increase if we swapped the rankings of d_{i,2} and d_{i,3}.

Normalized Discounted Cumulative Gain

NDCG_k is just a normalized version of DCG_k. The normalization is done by finding the DCG_k of an ‘ideal’ ranking. In our example, the ideal ranking would be:

\pi_i = \{d_{i,1}, d_{i,3}, d_{i,2}, d_{i,4}\}


DCG_{k_{ideal}}=\sum_{j=1}^{k} \frac{2^{y_{i,\pi_{i,j}}}-1}{log(i+1)}=31+19.5+1.5+0=52


NDCG_k= \frac{DCG_k}{DCG_{k_{ideal}}}=\frac{48.4}{52}=0.93

We can see that if our predicted rankings were ideal, they would receive an NDCG_k score of 1.

Mean Average Precision (MAP)

Mean average precision (MAP) is another popular metric for learning to rank, specifically for applications with binary relevance scores, e.g. “correct” and “incorrect” in question-answering.


MAP relies on Precision at k (P@k), which involves counting the number of relevant (or “correct”) documents in the first k ranked positions, and dividing by k:



Average precision is then defined as:

AP = \frac{\sum_{k=1}^{n}(P@k \times I(relevance_k==1))}{\sum_{k=1}^{n}I(relevance_k==1)}


Notice that AP is computed for a single query.  MAP is the mean AP value over all queries:

MAP = \frac{1}{numQueries}\sum_{q=1}^{numQueries}AP_{q}.

Learning to Rank Models

Learning to Rank models can be broadly classified into three groups: scorewise, pairwise, and listwise. We’ll introduce and provide an example of each group. There are many more learning to rank models than the ones covered here.

Pointwise Models

Pointwise models, also known as ‘scorewise’ models, create rankings by computing a real-valued score for each example \phi(q_{i},d_{i,j}), then sort by the score.

An advantage of pointwise models is that they have an interpretable confidence score for each query-document pair. Possible disadvantages are that there is not a notion of grouping the training instances (e.g. by query), and that the prediction target is exact relevance rather than relative position.

Logistic regression, PRank, and MCRank are examples of pointwise models.

Pairwise Models

Pairwise models learn to decide which of two documents is ranked higher. The models rely on the observation that in learning to rank problems we ultimately care about the relative position of documents rather than exact distances between document scores.

Pairwise models first isolate the documents for each query, then form a training set consisting of pairs of documents, and a label for whether the first document has a higher rank than the second document. So each training example is of the form (\phi(q_{i},d_{i,j}),\phi(q_{i},d_{i,k}),\{+1,-1\}). Model-wise, we find f such that f(d_{i,j})>f(d_{i,k}) when y_{i,j}>y_{i,k}.

The problem is transformed into a binary classification problem; we train a binary classifier to minimize the number of incorrectly ordered pairs.

Several pairwise models exist, each taking different approaches to the problem. Examples include RankNet, SVMRank, and AdaRank. Another model called LambdaMART will be the focus of the next post.

Listwise Models

Listwise models are trained by optimizing a loss function that measures predicted ranks versus actual ranks for a query’s entire candidate list. A loss function such as MAP or NDCG can be directly optimized by listwise models. Examples include coordinate ascent and LambdaRank.

Up Next

We’ve introduced the problem of Learning to Rank, and briefly surveyed its algorithms. In the next post, we’ll examine the LambdaMART model, and develop a way of visualizing its training process.

Sources and Further Reading

A Short Introduction to Learning to Rank

Learning to Rank QA Data

“Tree Ensembles for Learning to Rank”, Yasser Ganjisaffar 2011 PhD Thesis

Learning to Rank Wikipedia

Scala Coursera Highlights

I recently completed the Functional Programming Principles in Scala course on Coursera. Along the way, I learned some interesting things ranging from small tidbits to major language features. I’m going to use this post as a chance to highlight and review my favorite things that I learned about Scala from the course. This is meant to be more of a quick review than a set of tutorials, so I’ll provide links to actual tutorials and further reading for each section.


One thing I liked was seeing abstract programming language concepts concretely being used in Scala. An example of this is the idea of variance, namely covariance and contravariance. Covariance and contravariance appear when investigating whether one type is a subtype of another type. This question comes up concretely in several places, such as deciding whether a type is a valid argument parameter, or deciding which types can be held by a data structure.

Argument Parameters

Functions accept subtypes as arguments. For instance, a function requiring an Animal type will accept Cat or Dog subtypes as arguments. An interesting (and tricky) case is when a function takes a function as an argument, e.g.

def fun[T] (animal: Animal, paramFun: Animal => T) : T

What are valid arguments for fun? The animal argument is clear; it’s just any subtype of Animal. But which functions could we pass in for paramFun? This requires determining the subtype of a function.

A natural guess may be that a function f1: A => B is a subtype of a function f2: C => D when A is a subtype of C and B is a subtype of D. However, this is incorrect!

It turns out that we need to ‘reverse’ the argument subtyping; C must be a subtype of A. Specifically, we have:

f1: A => B is a subtype of f2: C => D when:

  • C is a subtype of A
  • B is a subtype of D

For instance, the function

def f1(animal: Animal) : Cat

is a subtype of

def f2(cat: Cat) : Animal

In terms of variances, we describe the above by saying that functions are contravariant in their argument types, and covariant in their return types.

The intuition for this is derived from the Liskov Substitution Principle, which is (paraphrased) the idea that a subtype should be expected to do everything that its supertypes can do. We should be able to use f1 in all of the ways that we can use f2.

To connect it back with this post’s topic, it turns out that defining and enforcing variance rules is a language feature in Scala! Scala provides the - and + type annotations for contravariance and covariance, respectively. Thus we can implement, for instance, a one parameter function class, and bake the variance rules into the type parameters:

class Function1[-V, +W] { ... }

The type tells us that Function1 is contravariant in V and covariant in W, as desired.

Type Bounds

Another type-related feature is type bounding. We can create a parameterized function and specify that its parameterized type is a subtype (or supertype) of another type, e.g:

def foo[T :> String] (t: T)

says that T must be a supertype of String, and

def foo[String <: T <: Object] (t: T)

says that T must be a supertype of String and a subtype of Object.

An Example

As an example of where we would use variance and type bounds, suppose we want to add a prepend function to the List class. Looking at Scala’s documentation, we can see that the List are parameterized by a covariant type A:

sealed abstract class List[+A] ...

Now, “`prepend“` will add an element to the front of the list; a first guess at its type signature might be:

// compile error
def prepend[A] (a: A) : List[A]

We are attempting to pass a covariant type as a function argument; since function arguments are contravariant this will result in a compile error.

To fix this, we can use a type bound to tell the compiler that we’d like to be able to pass in any supertype B of A, and produce a new list of B’s:

// compiles
def prepend[B >: A] (b: B) : List[B]

For Comprehensions

The functions map, flatMap, and filter are used commonly in functional programming. A common pattern is to chain these functions together, which can lead to bloated and unreadable (albeit effective) code. A simple example:

(1 until 10).flatMap(x => 
   (1 until 20).filter(y => y >= 10)
   .map(y => (x,y)))

Scala provides for comprehensions as a way of chaining map, flatMap, and filter operations together. Broadly stated, for comprehensions are syntactic sugar that map ‘arrow’, ‘if’, and ‘yield’ to map, filterWith, and flatMap, leading to more concise and readable code:

for {
  x <- (1 until 10)
  y <- (1 until 20)
  if y >= 10 
} yield (x, y)

Any type that supports map,filter, and flatMap is eligible to be used in a for comprehension, such as List, Future, Map, and many many more. For those familiar with Haskell, Scala’s for comprehensions are analogous to Haskell’s do notation, which eliminates the clutter of explicitly writing binds.

Call by Optional

By default, Scala functions are call by value, meaning that function parameters will be reduced prior to evaluating the function body. However, Scala let’s you override this default by adding => before a parameter’s type:

// x is call by value
// y is call by value
def fun1(x: Int, y: Int)

// x is call by value
// y is call by name
def fun2(x: Int, y: => Int)

The y parameter in fun2 will be call by name, meaning that it will only be evaluated if and when it’s used in the function body.

While this is a relatively minor feature, it speaks to a general theme that I’ve noticed with Scala: the language provides you with many tools, opening up multiple ways to solve a problem. While this flexibility can be misused or confusing, so far I’ve found that it contributes to Scala being a great practical language. Code can be structured using functional ideas, while still incorporating object-oriented ideas and allowing access to the extensive ecosystem of Java libraries.

Here are some links with more details on these areas:

Variance and Type Bounds:

For Comprehensions:

There are also some good general Scala resources that the course pointed to:

LDAOverflow with Online LDA

In the past two posts (part I and part II), we used Latent Dirichlet Allocation (LDA) to discover topics for tweets, and visualized them. In this post, we’ll investigate using LDA on an 8gb dataset of around 8 million Stack Overflow posts.

We’ll need to take a different approach; for the tweets we used a batch algorithm that worked well for the relatively small dataset of around 5000 tweets, but would likely introduce performance issues when running on massive datasets. The batch algorithm also assumed that we have the entire training set at the start of training, making the approach unviable for streaming data, which we may receive during training. It’d be nice to train the model incrementally, so that we can train on a chunk of data, then resume training if we receive more data without have to retrain on the original chunk.

In this post, we’ll look at Online LDA, a variation of ‘vanilla’ LDA that can be trained incrementally in small batches. Online LDA is a good choice for large datasets since we only need to hold a very small subset of the dataset in memory at a given time, and a good fit for streaming data since we can continually feed in new data batches as we receive them. We’re also able to save the model state at a point in training, then resume later when we want to train on more data.

First, we’ll jump into the math and look at the differences between online and batch LDA. Then we’ll use a python implementation of online LDA to discover topics for the Stack Overflow dataset. As usual, all of the associated code is available on GitHub.

Variations on Variational Bayes

For brevity this part will assume that you’ve read through the math background in the first LDA post. I’ll also only highlight major parts; for the full story check out Hoffman’s online LDA paper.

In LDA, our ultimate goal is to find the posterior distribution of latent topic variables after observing training data. However, computing this distribution is intractable, so we’re forced to approximate. One approximation approach is to use an optimization method called Variational Bayes.

In short, we approximate the true distribution by a simple distribution q, and associate parameters \phi,\ \gamma,\ \lambda with the original parameters z,\ \theta,\ \beta respectively. Recall that z gives the topic assignments for each word in each document , \theta gives the topic composition of each document, and \beta gives the word-topic probabilities for each word and each topic.

Specifically, we have:

q(z_{di}=k) = \phi_{dw_{di}k}

q(\theta_{d}) = dirichlet(\theta_{d}; \gamma_{d})

q(\beta_{k}) = dirichlet(\beta_{k}; \lambda_{k})

Our goal is to estimate \phi,\ \gamma,\ \lambda. In both batch and online LDA, we alternate between two steps:

1. E-Step: Estimate \phi,\ \gamma using the current value of \lambda

2. M-Step: Update \lambda , using the current value of \phi

The core difference between batch and online LDA is in how these steps are carried out at the algorithmic level.

Starting with Batch

In batch Variational Bayes, we perform multiple passes over the entire dataset, checking each time for convergence. During each pass, the algorithm does an E-Step using the entire dataset. At a high level:

for d = 1 to numDocs
    initialize \gamma
    repeat until change in \phi < \epsilon
        update \phi
        update \gamma

Then the M-Step updates \lambda using \phi values from every document:

update \lambda

The specific updates are:

\phi_{d,word_{i},k} \propto e^{E_{q}(log \theta_{d,k})+E_{q}(log\beta_{k,word_{i}})}

\gamma_{d,k} = \alpha + \sum_{word_{i}}\phi_{d,word_{i},k}n_{d,word_{i}}

\lambda_{k,word_{i}}=\eta +\sum_{d}n_{d,word_{i}}\phi_{d,word_{i},k}

Where n_{d,word_{i}} is the number of occurrences of word_{i} in document d.

Going Online

In online Variational Bayes, we only make a single sweep of the entire dataset, analyzing a chunk of documents at a time. A ‘chunk’ could be a single document, 42 documents, or even the entire dataset. Let’s let a ‘chunk’ be 1000 documents.

The online E-Step only uses the current chunk; instead of 8 million posts we now only have to hold 1000 in memory. The E-Step finds locally optimal values for \phi and \gamma:

initialize \gamma
repeat until change in \phi < \epsilon
    update \phi
    update \gamma

In the M-Step, we first compute \lambda', which is the value of \lambda if we imagined that the entire dataset is made up of \frac{numDocs}{chunkSize} copies of the current chunk. Then \lambda is updated using a weighted sum of \lambda' and \lambda:

compute \lambda'
update \lambda

The specific updates are:

\phi_{iter,word_{i},k} \propto e^{E_{q}(log \theta_{iter,k})+E_{q}(log\beta_{k,word_{i}})}

\gamma_{iter,k} = \alpha + \sum_{word_{i}}\phi_{iter,word_{i},k}n_{iter,word_{i}}

\lambda'_{k,word_{i}}=\eta +batchSize*n_{iter,word_{i}}\phi_{iter,word_{i},k}

\lambda = (1-\rho_t)\lambda+\rho_t\lambda'

Where n_{iter,word_{i}} is the number of occurrences of word_{i} in the current iteration’s chunk of documents, and \rho_t is a weighting parameter.

We can see that unlike batch LDA, in online LDA we only need to hold a small chunk of the data at a time, and once we’re done analyzing it, we never need it again. As with batch, once we’ve estimated \lambda, we can find the most probable words for each topic by looking at the word probabilities in each row of \lambda.

Intuitions of the Inference

If we squint and step back, LDA consists of using simple word counts in a clever way. The two parameters we ultimately care about are \gamma and \lambda. How do these get updated during training?

Updates of \gamma (the topic compositions for each document) are the prior \alpha plus a weighted sum of word counts. The word counts are weighted by \phi_{word,topic}, the probability of assigning the word to the topic. Intuitively, if we count a lot of instances of “potato” in a document, and “potato” is likely to be assigned to topic 2, then it makes sense that the document has more of topic 2 in it than we previously thought.

Updates of \lambda_{topic,word} (the word-topic probabilities) use word counts weighted by the probability that the word will be assigned to the given topic. If “potato” shows up a lot in the dataset is likely to be assigned to topic 2, then it makes sense that \lambda_{2, potato} should increase.

LDA Overflow

Now it’s time to run Online LDA on the Stack Overflow dataset to discover topics without overflowing our memory. Stack Exchange kindly provides (and updates) a data dump of all of its user generated content; I chose the dataset.

Read, Clean, Parse, Repeat

The data arrives as a 27gb XML behemoth. The first step is isolating the text from the Title and Body fields for each row. These fields will comprise a ‘document’, and our dataset will be formatted as a text file with one document per line.

Since the file is so large, we need to incrementally read the XML. We also filter out non alpha-numeric characters. Details for this process can be found in

Once runs, we get an 8gb text file containing around 8,000,000 stack overflow documents (title and body content). A couple examples:

Throw an error in a MySQL trigger If I have a trigger before the update on a table how can I throw an error that prevents the update on that table

Compressing Decompressing Folders Files Does anyone know of a good way to compress or decompress files and folders in C quickly Handling large files might be necessary

LDA by Hoffman

We’ll use a Python implementation of online LDA written by Matt Hoffman, available on his webpage. We need to adapt the high-level running script for our application; to do so I created a wrapper for running LDA called Use

python -h

to see the various command line arguments.

I’ve also added more comments to on the repo in case you’d like to further inspect how the actual Online LDA algorithm is implemented.

Building a Vocabulary

The LDA implementation assumes that we have a vocabulary file prior to training so that it can compactly represent documents as numeric word IDs. The vocabulary also allows us the algorithm to associate an index of \beta with a word ID and hence with a word.

We can generate a domain-specific vocabulary using the first 100,000 Stack Overflow posts, and supplement it with the vocabulary provided by Hoffman, which contains the most frequent English words. Gensim has a nice library for creating vocabularies. We filter out words that appear in fewer than 10 documents, since they are often ‘junk’ words, and would probably not appear in the top words for a topic anyways since they appear so infrequently. Code for the vocabulary generation is found in


Let’s kick it off!

python dataset.txt vocabulary.txt

The training took ~12 hours for a 100 topic model on my MacBook. The values of \gamma and \lambda are output to files every 10,000 iterations and when the training completes. We can then use one of the \lambda files to see the top 20 words for the top N topics. For instance, to print the top 2 topics with the final model, use:

python vocabulary.txt lambda-final.dat 2


topic 0

topic 1

The numbers are the word-topic probabilities from \lambda.

We’ll use the approach from the first LDA post to create word clouds for two different topics:



We managed to find topics on a dataset of 8 million Stack Overflow posts by using Online LDA. Feel free to download the code and try it out on other datasets!

Credits & Links

Much of the material is derived from Hoffman’s paper and his online LDA implementation. Gensim and Vowpal Wabbit also have implementations of online LDA.

These Are Your Tweets on LDA (Part II)

In the last post, I gave an overview of Latent Dirichlet Allocation (LDA), and walked through an application of LDA on @BarackObama’s tweets. The final product was a set of word clouds, one per topic, that showed the weighted words that defined the topic.

In this post, we’ll develop a dynamic visualization that incorporates multiple topics, allowing us to gain of a high level view of the topics and also drill down to see the words that define each topic. Through a simple web interface, we’ll also be able to view data from different twitter users.

Click here for an example of the finished product.

As before, all of the code is available on GitHub. The visualization-related code is found in the viz/static directory.

Harnessing the Data

In the last post, we downloaded tweets for a user and found 50 topics that occur in the user’s tweets along with the top 20 words for each topic. We also found the composition of topics across all of the tweets, allowing us to rank the topics by prominence. For our visualization, we’ll choose to display the 10 highest ranked topics for a given twitter user name.

We need a visualization that can show multiple groupings of data. Each of the 10 groupings has 20 words, so we’d also like one that avoids the potential information overload. Finally, we’d like to incorporate the frequencies that we have for each word.


A good fit for these requirements is d3.js‘s Zoomable Pack Layout, which gives us a high level view of each grouping as a bubble. Upon clicking a bubble, we can see the data that comprises the bubble, as well as each data point’s relative weight:


d3 to the rescue

d3 Zoomable Pack Layout

In our case, each top-level bubble is a topic, and each inner bubble is a word, with its relative size determined by the word’s frequency.

Since the d3 visualization takes JSON as input, in order to plug in our LDA output data we simply create a toJSON() method in that outputs the data associated with the top 10 topics to a JSON file. The ‘name’ of each topic is simply the most probably word in the topic.

Now, when the LDA process (the main() method in is run for a twitter user, the code will create a corresponding JSON file in viz/json. The JSON structure:

     "name": {topic_1_name},
        "name": {topic_1_word_1},
        "size": {topic_1_word_1_freq}
        "name": {topic_1_word_2},
        "size": {topic_1_word_2_freq}
        "name": {topic_1_word_3},
        "size": {topic_1_word_3_freq}


Now, we make slight modifications to the javascript code embedded in the given d3 visualization. Our goal is to be able to toggle between results for different twitter users; we’d like to switch from investigating the @nytimes topics to getting a sense of what @KingJames tweets about.

To do so, we add a drop-down to index.html, such that each time a user is selected on the drop-down, their corresponding JSON is loaded by the show() function in viz.js. Hence we also change the show() function to reload the visualization each time it is called.

Making The Visualizations Visible

To run the code locally, navigate to the viz/static directory and start an HTTP server to serve the content, e.g.

cd {project_root}/viz/static
python -m SimpleHTTPServer

then navigate to http://localhost:8000/index.html to see the visualization.

By selecting nytimes, we see the following visualization which gives a sense of the topics:

@nytimes topics

Upon clicking the ‘gaza’ topic, we see the top words that comprise the topic:

'gaza' topic

I’ve also used Heroku to put an example of the finished visualization with data from 10 different twitter usernames here:

Have fun exploring the various topics!

These Are Your Tweets on LDA (Part I)

How can we get a sense of what someone tweets about? One way would be to identify themes, or topics, that tend to occur in a user’s tweets. Perhaps we can look through the user’s profile, continually scrolling down and getting a feel for the different topics that they tweet about.

But what if we could use machine learning to discover topics automatically, to measure how much each topic occurs, and even tell us the words that make up the topic?

In this post, we’ll do just that. We’ll retrieve users’ tweets, and use an unsupervised machine learning technique called Latent Dirichlet Allocation (LDA) to uncover topics within the tweets. Then we’ll create visualizations for the topics based on the words that define them. Our tools will be Java, Twitter4J, and Mallet. All of the code is available on GitHub for reference.

As a sneak preview, here’s a visualization of a topic from @gvanrossum:

Screen Shot 2014-08-30 at 10.04.54 PM

First I’ll give an intuitive background of LDA, then explain some of the underlying math, and finally move to the code and applications.

What’s LDA?

Intuitively, Latent Dirichlet Allocation provides a thematic summary of a set of documents (in our case, a set of tweets). It gives this summary by discovering ‘topics’, and telling us the proportion of each topic found in a document.

To do so, LDA attempts to model how a document was ‘generated’ by assuming that a document is a mixture of different topics, and assuming that each word is ‘generated’ by one of the topics.

As a simple example, consider the following tweets:

(1) Fruits and vegetables are healthy.

(2) I like apples, oranges, and avocados. I don’t like the flu or colds.

Let’s remove stop words, giving:

(1) fruits vegetables healthy

(2) apples oranges avocados flu colds

We’ll let k denote the number of topics that we think these tweets are generated from. Let’s say there are k = 2 topics. Note that there are V = 8 words in our corpus. LDA would tell us that:

Topic 1 = Fruits, Vegetables, Apples, Oranges, Avocados
Topic 2 = Healthy, Flu, Colds

And that:

Tweet 1 = (2/3) Topic 1, (1/3) Topic 2
Tweet 2 = (3/5) Topic 1, (2/5) Topic 2

We can conclude that there’s a food topic and a health topic, see words that define those topics, and view the topic composition of each tweet.

Each topic in LDA is a probability distribution over the words. In our case, LDA would give k = 2 distributions of size V = 8. Each item of the distribution corresponds to a word in the vocabulary. For instance, let’s call one of these distributions \beta_{1}. It might look something like:

\beta_{1} = [0.4, 0.2, 0.15, 0.05, 0.05, 0.05, 0.05]

\beta_{1} lets us answer questions such as: given that our topic is Topic #1 (‘Food’), what is the probability of generating word #1 (‘Fruits’)?

Now, I’ll jump into the math underlying LDA to explore specifically what LDA does and how it works. If you still need some more intuition-building, see Edwin Chen’s great blog post. Or feel free to skip to the application if you’d like, but I’d encourage you to read on!

A Bit More Formal

LDA assumes that documents (assumed to be bags of words) are generated by a mixture of topics (distributions over words). We define the following variables and notation:

k is the number of topics.

V is the number of unique words in the vocabulary.

\theta  is the topic distribution (of length k ) for a document, drawn from a uniform Dirichlet distribution with parameter \alpha .

z_{n} is a topic 'assignment' for word w_{n}, sampled from p(z_{n} = i|\theta) = \theta_{i}.

\textbf{w} = (w_{1}, ... , w_{N}) is a document with N words.

w_{n}^{i} = 1 means that the word w_{n} is the i'th word of the vocabulary.

\beta  is a k \times V matrix, where each row \beta_{i}  is the multinomial distribution for the ith topic. That is, \beta_{ij} = p(w^{j} = 1 | z_{j} = i).

LDA then posits that a document is generated according to the following process:

1. Fix k and N.

2. Sample a topic distribution \theta from Dir(\alpha_{1}, ... , \alpha_{k}) .

\theta defines the topic mixture of the document, so intuitively \theta_{i} is the degree to which topic_{i} appears in the document.

3. For each word index n \in \left\{1,..., N\right\}:

4. Draw z_{n} from \theta.

z_{n} = i tells us that the word we are about to generate will be generated by topic i.

5. Draw a word w_{n} from \beta_{z_{n}}.

In other words, we choose the row of \beta based on our value of z_{n} from (4), then sample from the distribution that this row defines. Going back to our example, if we drew the “Food” row in step (4), then it’s more likely that we’ll generate “Fruits” than “Flu” in step (5).

We can see that this does in fact generate a document based on the topic mixture \theta , the topic-word assignments z, and the probability matrix \beta .

However, we observe the document, and must infer the latent topic mixture and topic-word assignments. Hence LDA aims to infer:

p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta) .

Coupling Problems

The story’s not over quite yet, though. We have:

p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta) = \frac{p(\theta, \textbf{z}, \textbf{w} | \alpha, \beta)}{p(\textbf{w} | \alpha, \beta)}

Let’s consider the denominator. I’m going to skip the derivation here (see pg. 5 of Reed for the full story), but we have:

p(\textbf{w} | \alpha, \beta) = \frac{\Gamma(\sum_{i=1}^{k}\alpha_{i})}{\prod_{i=1}^{k}\Gamma(\alpha_{i})} \int (\prod_{i=1}^{k}\theta_{i}^{\alpha_{i}-1})(\prod_{n=1}^{N}\sum_{i=1}^{k}\prod_{j=1}^{V}(\theta_{i}\beta_{ij})^{w_{n}^{j}})\,d\theta

We cannot separate the \theta and \beta, so computing this term is intractable; we must find another approach to infer the hidden variables.

A Simpler Problem

A workaround is to find a convex distribution that lower-bounds the distribution that we want to estimate. Then we can find an optimal lower bound to estimate the distribution that is intractable to compute. We simplify the problem to:

q(\theta, \textbf{z}|\gamma, \phi) = q(\theta|\gamma)\prod_{n=1}^{N}q(z_{n}|\phi_{n})

And minimize the KL-Divergence between this distribution and the actual distribution p(\theta, \textbf{z} |\textbf{w}, \alpha, \beta) , resulting in the problem:

(\gamma^{*}, \phi^{*}) = argmin_{\gamma, \phi} D_{KL}(q(\theta, z|\gamma, \phi)||p(\theta, z|w, \alpha, \beta))

Since we also do not know \beta and \alpha, we use Expectation Maximization (EM) to alternate between estimating \beta and \alpha using our current estimates of \gamma and \phi, and estimating \gamma and \phi using our current estimates of \beta and \alpha.

More specifically, in the E-step, we solve for (\gamma^{*}, \phi^{*}), and in the M-step, we perform the updates:

\beta_{ij} \propto\sum_{d=1}^{M}\sum_{n=1}^{N_{d}}\phi^{*}_{dni}w^{j}_{dn}


log(\alpha^{t+1}) = log(\alpha^{t}) - \frac{\frac{dL}{d\alpha}}{\frac{d^{2}L}{d\alpha^{2}\alpha}+\frac{dL}{d\alpha}}

I’ve glossed over this part a bit, but the takeaway is that we must compute a lower bound of the actual distribution, and we use EM to do so since we have two sets of unknown parameters. And in the end, we end up with estimates of \theta, \textbf{z}, \beta as desired.

For more in depth coverage, see Reed’s LDA Tutorial or the original LDA paper.

Now, let’s move to the code.

The Plan

We’ll use @BarackObama as the running example. First we’ll download @BarackObama’s tweets, which will be our corpus, with each tweet representing a ‘document’.

Then, we’ll run LDA on the corpus in order to discover 50 topics and the top 20 words associated with each topic. Next, we will infer the topic distribution over the entire set of tweets. Hence we’ll be able to see topics and the degree to which they appear in @BarackObama’s tweets.

Finally, we’ll visualize the top 20 words for a given topic based on the relative frequency of the words.

I’ll walk through the important parts of the code, but I’ll skip the details in the interest of brevity. For the whole picture, check out the code on GitHub; running the main() method of will run the entire process and produce similar results to those shown below.

Getting the Tweets

To retrieve the tweets, we’ll rely on the Twitter4J library, an unofficial Java library for the Java API. The code found in is a wrapper that helps out with the things we need. The method that does the ‘work’ is

public List<String> getUserTweetsText(String username, int n)

which retrieves the last n of a user’s tweets and returns them as a List of Strings. In this method we access 1 page (200 tweets) at a time, so the main loop has n/200 iterations.

The highest-level method is

public void downloadTweetsFromUser(String username, int numTweets)

which calls getUserTweetsText() and saves the output to files. For organization’s sake, it saves the user’s tweets to

  • ./data/{username}/{username}_tweets.txt, which contains one tweet per line
  • ./data/{username}/{username}_tweets_single.txt, which contains all of the tweets on a single line. This second file will be used later to infer the topic distribution over the user’s entire set of tweets.

Hence we can download 3000 of @BarackObama’s tweets like so:

TwitterClient tc = new TwitterClient();
tc.downloadTweetsFromUser("BarackObama", 3000);

Hammering Out Some Topics

Now it’s time to take a Mallet to the tweets in order to mine some topics. Mallet is a powerful library for text-based machine learning; we can use its topic modeling through its Java API to load and clean the tweet data, train an LDA model, and output results. In order to use the Mallet API, you’ll have to follow the download instructions and build a jar file, or get it from this post’s GitHub repo.

The Mallet-related code that I’ll discuss next is found in

Loading the Data

The first step is to prepare and load the data. Our goal is to get the data in the {username}_tweets.txt file into an InstanceList object, i.e. a form that can be used by Mallet models.

To do so, we first create a series of “Pipes” to feed the data through. The idea is that each Pipe performs some transformation of the data and feeds it to the next Pipe. In our case, in

static ArrayList<Pipe> makePipeList()

we create a series of Pipes that will lowercase and tokenize the tweets, remove stop words, and convert the tweets to a sequence of features.

Then, in

static InstanceList fileToInstanceList(String filename)

we iterate through the input file, and use our Pipe list to modify and prepare the data, returning the InstanceList that we set out to build.

 Training Time

It’s training time. Using

public static ParallelTopicModel trainModel(InstanceList instances,
                                      int numTopics, int numIters,
                                      double alphaT, double betaW)

we train an LDA model called ParallelTopicModel on the InstanceList data. The betaW parameter is a uniform prior for \beta , and the alphaT parameter is the sum of the \alpha parameter; recall from the math section that \beta is the numtopics \times vocabsize matrix that gives word probabilities given a topic, and \alpha is the parameter for the distribution over topics.

Looking at the Output, Part I

With a trained model, we can now look at the words that make up the topics using \beta , and the composition \theta_i for tweet_i .

Printed to stdout, we see the 50 topics, with the top 20 words for each topic. The number attached to each word is the number of occurrences:

Topic 0 
families:13 republican:11 read:7 fast:5 ed:5 op:5 family:5 marine:4 democrat:4 efforts:4 policies:4 story:4 pendleton:3 camp:3 explains:3 military:3 #cir:3 california:3 #familiessucceed:3 workplace:3
Topic 1 
president:175 obama:165 middle:61 class:59 jobs:47 #abetterbargain:37 economy:37 good:32 growing:20 #opportunityforall:20 isn:19 #rebuildamerica:18 today:18 create:17 watch:17 infrastructure:16 americans:16 plan:15 live:15 american:15

Topic 48 
president:72 address:62 obama:57 watch:56 weekly:49 opportunity:15 economic:15 discusses:14 issue:11 importance:10 working:10 week:10 speak:9 budget:8 discuss:8 congress:7 calls:6 building:6 #opportunityforall:6 lady:5
Topic 49 
discrimination:19 lgbt:17 rights:15 #enda:15 americans:14 law:10 act:10 today:10 thedreamisnow:7 screening:7 voting:7 protections:7 basic:7 stand:7 add:7 american:7 anniversary:6 workplace:6 workers:6 support:6

Each box corresponds to taking a row of \beta , finding the indices with the 20 highest probabilities, and choosing the words that correspond to these indices.

Since each tweet is a document, the model also contains the topic distribution for each tweet. However, our goal is to get a sense of the overall topic distribution for all of the user’s tweets, which will require an additional step. In other words, we’d like to see a summary of the major topics that the user tends to tweet about.

Getting An Overall Picture

To do so, we will use our trained model to infer the distribution over all of the user’s tweets. We create a single Instance containing all of the tweets using

singleLineFile = {username}_tweets_single.txt

and find the topic distribution:

double[] dist = inferTopicDistribution(model, 

We can then look at the distribution in  ./data/{username}/{username}_composition.txt:

0 0.006840685214902167
1 0.048881207831654686
29 0.09993216645340489
48 0.022192334924649955
49 0.01473112438501846

We see, for instance, that topic 29 is more prominent than topic 0; specifically, the model inferred that more words were generated by topic 29 than topic 0.

In ./data/{username}/{username}_ranked.txt we have the top 10 topics, ranked by composition, along with each topic’s top words. For instance, at the top of the file is:

Topic 29:

This topic could probably be labeled as “presidential”; a topic we’d expect to find near the top for @BarackObama.

Looking on, we see a topic that is clearly about healthcare:

Topic 24

and one about climate change:

Topic 28

The inferred topics are pretty amazing; a job well done by LDA. But while viewing words and their frequencies may be fun, let’s visualize a topic in a nicer way.

Into the Clouds

We now have the words that make up the most prominent topics, along with the frequency of each word. A natural visualization for a topic is a word cloud, which allows us to easily see the words and their relative weights.

It turns out that a word-cloud generator named Wordle can create word clouds given a list of weighted words…exactly the format found in ./data/{username}/{username}_ranked.txt !

Let’s copy the word:frequency list for Topic 29 and throw it into Wordle (dialing down obama and president to 200):

and the results:

Topic 29 – “Presidential”

Topic 24 – “Healthcare”

Topic 28 – “Climate Change”

A Brief Pause

Let’s pause for a second. This is a nice moment and a beautiful result.

Take a look at the words : LDA inferred a relationship between “country”, “america”, and “obama”. It grouped “insurance”, “#obamacare”, and “health”. It discovered a link between “climate”, “deniers”, and “#sciencesaysso”.

Glance back up at the math section. We never told it about presidents, countries, or healthcare. Nowhere in there is there a hard-coded link between climate words and science hash tags. In fact, it didn’t even know it would be dealing with words or tweets.

It’s ‘just’ an optimization problem, but when applied it can discover complex relationships that have real meaning. This is an aspect that I personally find fascinating about LDA and, more generally, about machine learning.

As a next step, feel free to download the code and try out other usernames to get a sense of LDA’s generality; its not just limited to @BarackObama, climate, or healthcare.

Next Steps: From One to Many

Currently, we have a nice way of viewing the content of one topic, in isolation. In the next post, we’ll develop a visualization using d3.js for all of the top topics at once. We’ll be able to see and compare topics side by side, and obtain a higher level view of the overall topic distribution. As a sneak preview, here’s a visualization of the top 10 topics from @nytimes:

@nytimes top 10

Credits and Links

Much of the mathematical content is derived from Reed’s Tutorial and the LDA Paper (or the shorter version). These are also great resources for learning more. Edwin Chen’s blog post also provides a good introduction and an intuition-building example. The Mallet developer’s guide and data importing guide provide good examples of using the Mallet API. The Programming Historian has a great intro to using Mallet for LDA from the command line.