# Evolving Networks

Finding neural network topologies is a problem with a rich history in evolutionary computing, or neuroevolution. This post will revisit some of the key ideas and outgoing research paths. Code related to this post is found here: [code link].

## NEAT

In their 2002 paper, Kenneth Stanley & Risto Miikkulainen proposed the foundational algorithm NeuroEvolution of Augmenting Topologies (NEAT). I’ll focus on this algorithm as a starting point; for earlier developments please see Section 5.2 of this great review, Schaffer’s 1992 review, and Yao’s 1999 review. The NEAT paper introduces the ideas clearly and there are other great NEAT overviews, so to change it up I will try to present the algorithm with generic notation, which is perhaps useful for thinking about how to modify the algorithm or apply it to a new problem setting.

I’ve also made an implementation [code link] contained in a single python file; you might find it useful to see the entire algorithm in one place, or as a comparison if you also implement NEAT as an exercise. For a more robust implementation, see NEAT-Python (which the code is based on) and its extension PyTorch-NEAT.

#### Problem

NEAT addresses the problem of finding a computation graph $G = (V, E)$. Each node $v\in V$ has a bias, activation, and aggregation function, written $(b, a, \text{agg})$, and each edge $e\in E$ has a source and destination, a weight, and may be active or inactive, written  $(u, v, w, \text{active})$.

Searching through the space of these graphs amounts to searching through a space of neural networks. NEAT conducts this search using a few generic neuroevolution concepts, which I’ll focus on below, and often implements them with design decisions that can be relaxed or modified for different problems.

#### High-Level Method

NEAT iteratively produces a set of candidates $P=\{G_1,\ldots,G_N\}$, using a candidate partitioning $S=\{S_1,\ldots,S_M\}$ where $S_j \subseteq P$ and a given function $f:\mathcal{G}\rightarrow\mathbb{R}$ which measures a candidate’s quality. The candidates, partitions, and quality function are known as ‘population’, ‘species’, and ‘fitness’, respectively.

The candidate set (“population”, rectangle) contains partitions (“species”, circles), each containing candidate graphs (diamonds).

Each NEAT iteration returns a new candidate set and new partitioning, denoted as $E(P^{(i)}, S^{(i)}, f)\rightarrow P^{(i+1)},S^{(i+1)}$. Intuitively E is an ‘evolution step’ that produces a new ‘generation’. NEAT’s goal is to eventually output a ‘good’ candidate set $P^{(i)}$. Typically good means that the best candidate has quality exceeding a goal threshold, $\max_j f(G^{(i)}_j) > \tau$. We then use this high-performing neural network on a task.

### Evolution Steps

Each evolution step $E(P^{(i)}, S^{(i)}, f)$ produces a new population using four ideas: mutation, crossover, fitness ranking, and partitioning.

Mutation $m(G)\rightarrow G$ randomly perturbs a candidate graph. In NEAT, mutations consist of adding or deleting a node, adding or deleting an edge, or perturbing a node or edge property (such as an edge’s weight or a node’s activation). Each mutation type occurs with a pre-specified probability, and involves a random perturbation; for instance an add-edge randomly chooses an edge location and weights are adjusted with Gaussian noise. One can design other mutations, such as resetting a weight to a new value.

An add-node mutation followed by an add-edge mutation. The add-node mutation splits an existing edge into two edges.

Crossover $c(G_i,G_j)\rightarrow G$ produces a new candidate by swapping properties of two existing candidates. In NEAT, roughly speaking if $G_i$ and $G_j$ have a matching node $v_i, v_j$, then $G$ receives one of them randomly (similarly for edges). $G$ simply inherits non-matching nodes or edges. The notion of ‘matching’ is tricky due to isomorphic graph structures, so NEAT assigns an ID to each new node and edge, then uses these IDs for comparison (see 2.2 and 3.2 of the NEAT paper for details).  In part due to the added complexity, some papers leave out crossover completely.

NEAT crossover mechanism (diagram from the NEAT paper)

Fitness Ranking follows its name, first ranking candidates according to fitness, $(G_{1'},\ldots,G_{N'})$ where $i'>j'$ means $f(G_{i'})>f(G_{j'})$. Only the top (e.g. 20%) candidates are used for crossover and mutation. This locally biases the search towards candidates with high relative fitness.

Partitioning, or speciation, groups candidates according to a distance function $d(G_i,G_j)$. One use of the partitions is to promote diversity in the solution space by modifying each candidate’s fitness. To do so, NEAT defines a distance function and adjusts each candidate’s fitness based on its partition size. Each partition is guaranteed a certain number of candidates in the next generation based on the adjusted fitnesses.

Intuitively, a small partition contains graphs with relatively unique characteristics which might ultimately be useful in a final solution, even if they do not yield immediate fitness. To avoid erasing these characteristics from the search during fitness ranking, the small partition candidates receive guaranteed spots in the next phase.

Novel structures (diamonds, triangles) may ultimately yield a performance gain after further development, despite initially having lower fitness (light green) compared to common, developed structures with high fitness (dark green).

We can write this step as $f_{\text{partition}}(P,f_1,\ldots,f_N,S)\rightarrow (f_1',\ldots,f_N', S')$. We might alternatively view this step as just fitness re-ranking, $f_{\text{re-rank}}(P,f_1,\ldots,f_N)\rightarrow (f_1',\ldots,f_N')$, without requiring actual partitions, though without partitions it may be tricky to achieve the exact ‘guaranteed spots’ behavior.

The partitions $S'$ could also be useful in problems requiring a collection of solutions rather than a single optimal solution. For instance, rather than just selecting the highest performing candidate, we might consider the best candidate in each partition as the final output of NEAT, thus producing a collection of networks, each maximizing fitness in a different way than the others (assuming a partitioning scheme that promotes diverse solutions).

### Example Results

Let’s use the implementation [code link] to solve an xor problem and the Cartpole and Lunar-Lander gym environments.

To solve xor, NEAT finds a network with a single hidden node:

An xor network, including input (green), hidden (blue), and output (red) nodes. Labels show edge weights and node activations.

CartPole-v0 is easy to solve (even random search is sufficient), and NEAT finds a simple network without hidden units (for fun we’ll also construct an artificially complicated solution in the Variations section below):

A CartPole-v0 network.

LunarLander-v2 is more difficult, and NEAT finds a network with non-trivial structure:

A LunarLander-v2 network.

On the xor environment, NEAT creates around 10 partitions, on Cartpole just 1, and on LunarLander it tends to create 2-3 partitions. On these simple environments NEAT also performs similarly without crossover.

Variations As mentioned before, we may want NEAT to produce a diverse set of solutions rather than a single solution. To manually demonstrate this intuition, suppose I want NEAT to find a network that uses sigmoid activations, and one that uses tanh. To do so, I increased the activation parameter in the node distance function (the $d(\cdot,\cdot)$ used in partitioning), then chose the highest scoring network from each partition. On Cartpole, the partitions now naturally separate into sigmoid and tanh networks:

While Cartpole is evidently simple enough for a network with no hidden layers, perhaps we want to follow a trend of using large networks even for easy problems. We can modify the fitness function to ‘reject’ networks without a certain number of connections, and NEAT will yield more complicated solutions:

A more complicated way to play Cartpole.

In particular, I added -1000 to the fitness when the network had less than k connections, starting with k=5 and incrementing k each time a candidate achieved max fitness at the current (stopping at k=20).

## Discussion & Extensions

Vanilla NEAT attempts to find both a network structure and the corresponding weights from scratch. This approach is very flexible and involves minimal assumptions, but could limit NEAT to problems requiring small networks. However, the key idea can still be applied or modified in creative ways.

### Minimal Assumptions

NEAT represents an extreme on the spectrum of learned versus hand-crafted architectural biases, by placing few assumptions on graph structure or learning algorithm. At a very speculative level, such flexibility may be useful for networks with backward or long-range connections that may be difficult to hand design, or as part of a learning process which involves removing or adding connections rather than optimizing weights of a fixed architecture.

A more concrete example is the recent Weight Agnostic Neural Networks paper (Gaier & Ha 2019), where the authors aimed to find a model for a task by finding good network structures, rather than finding good weights for a fixed network structure; they use a single shared weight value in each network and evaluate fitness on multiple rollouts, with a randomly selected weight value for each rollout. In this case, a NEAT variant allowed finding exotic network structures from scratch, without requiring prior knowledge such as hand-designed layer types.

As a rough approximation, I modified the NEAT implementation so that each network only has a single shared weight value, and included more activation functions (sin, cos, arctan, abs, floor). Each run of evaluation sets the network’s shared weight to a randomly sampled value ($\mathcal{U}([-2,2])$ excluding $[-0.1,0.1]$), and the network’s overall fitness is the average fitness over 10 runs. On XOR, NEAT finds a network with similar structure as before:

XOR network with a random shared weight value

This was just an initial experiment to give intuition, so check out the WANN paper for a good way of doing this for non-trivial tasks.

### Scalability

One could also consider improving NEAT’s scalability. A high level strategy is to reduce the search space by restricting the search to topologies, searching at a higher abstraction level, or introducing hierarchy.

An example is DeepNEAT (Miikkulainen et al 2017), which evolves graph structures using NEAT, but with nodes representing layers rather than single neurons, and edges specifying layer connectivity. Weight and bias values are learned with back-propagation. The authors further extend DeepNEAT to CoDeepNEAT, which represents graphs with a two level hierarchy defined by a blueprint specifying connectivity of modules. Separate blueprint and module populations are evolved, with the full graph (module + blueprint) assembled for fitness evaluation.

Blueprint and Module populations. Each node in a Blueprint (hexagon) is a Module.

This view is quite general, allowing learning the internal structure of reusable modules as well as how they are composed. In the experiments the authors begin with modules involving known components such as convolutional layers or LSTM cells and evolve only specific parts (e.g. connections between LSTM layers), but one might imagine searching for completely novel, reusable modules.

### Indirect Encodings

NEAT essentially writes down a description, or direct encoding, of every node and edge and their properties, then evolves these descriptions. The description size grows as the network grows, making the search space prohibitively large.

An alternative is to use a function to describe a network. For instance, we can evaluate a function $f:\mathbb{R}^4\rightarrow \mathbb{R}$ at pairs of points from a $V\times V$ grid to obtain a weighted adjacency matrix. This function is an example of an indirect encoding of the graph. Assuming the description of $f$ is small, we can describe very large networks by evaluating a suitable $f$ using a large grid or coordinate pattern. A neural network with a variety of activations that is evaluated in this manner is called a compositional pattern producing network (CPPN) [see, also].

HyperNEAT (Stanley et al. 2009) uses this idea to find network weights by evolving an indirect encoding function. HyperNEAT uses NEAT to evolve a (small) CPPN to act as $f$, then evaluates $f$ at coordinates from a hyper-cube, resulting in weights of a (larger) network used for fitness evaluation.

Several works have adopted or extended ideas from HyperNEAT for a deep learning setting. Fernando et al. 2016 proposed the Differentiable Pattern Producing Network (DPPN) which evolves the structure of a weight-generating CPPN $f$ while using back-propagation for its weights. The authors evolve a 200 parameter $f$ that generates weights for a fully connected auto-encoder with ~150,000 weights, though it is for a small-scale MNIST image de-noising task. Interestingly the weight generating function $f$ learns to produce convolution-esque filters embedded in the fully connected network.

From [Fernando et al 2016]

HyperNetworks (Ha et al 2016) further scales HyperNEAT’s notion of indirect encodings to more complex tasks by learning a weight generation function with end-to-end training, including an extension that can generate time-varying weights for recurrent networks:

From [Ha et al. 2016]

### Wrapping Up

In this post we revisited a core technique for generating neural network topologies, and briefly traced some of its outgoing research paths. We took a brief step back from the constraints of pre-defined-layer architectures and searched through a space of very general (albeit small-scale) topologies. It was interesting to see how this generality has been refined towards some larger scale tasks, but also revisited.  We briefly saw how fitness re-ranking and partitioning can be used to yield a set of distinct solutions, which connects to other concepts that I may discuss further in future posts.

[Code]

# ECIR 2016 Paper and Presentation

I recently presented a paper entitled Efficient AUC Optimization for Information Ranking Applications at the European Conference for Information Retrieval (ECIR) in Padua, Italy.

In the paper, we derive gradient approximations for optimizing area under an ROC curve (AUC) and multi-class AUC using the LambdaMART algorithm. Here are slides for the talk, which give an overview of the paper and a brief review of Learning to Rank and LambdaMART:

The goal was to expand LambdaMART to optimize AUC in order to add to the portfolio of metrics that the algorithm is currently used for. The approach was to derive a “λ-gradient” analogous to those that have been defined for other metrics such as NDCG and Mean Average Precision.

This in turn required an efficient way of computing a key quantity, namely the change in the AUC from swapping two items in the ranked list. One contribution of the paper is a simple, efficient formula for computing this quantity (along with a proof), which appears in the paper as:

The paper also contains experimental results measuring the performance of LambdaMART using the derived gradients.

# Peering into the Black Box : Visualizing LambdaMART

In the last post, I gave a broad overview of the Learning to Rank domain of machine learning that has applications in web search, machine translation, and question-answering systems. In this post, we’ll look at a state of the art model used in Learning to Rank called LambdaMART. We’ll take a look at some math underlying LambdaMART, then focus on developing ways to visualize the model.

LambdaMART produces a tree ensemble model, a class of models traditionally viewed as ‘black boxes’ since they take into account predictions of 10’s or 100’s of underlying trees. While viewing a single decision tree is intuitive and interpretable, viewing 100 at once would be overwhelming and uninformative:

Single Tree

Ensemble of Trees

By moving from a single tree to an ensemble of trees, we tradeoff interpretability for performance.

In this post, we’ll take a look at the trees produced by the LambdaMART training process, and develop a visualization to view the ensemble of trees as a single collective unit. By doing so, we’ll gain back some of the intuitive interpretability that make tree models appealing. Through the visualization, we’ll peer into the black box model and gain some sense of the factors that the model uses to make predictions.

We use Java and the JForests library to train LambdaMART and parse its output, and d3.js for visualization. To get a sneak peek, the final visualization is here.

LambdaMART Overview

Let’s take a look at some details of LambdaMART. For the full story, check out this paper from Microsoft Research.

At a high level, LambdaMART is an algorithm that uses gradient boosting to directly optimize Learning to Rank specific cost functions such as NDCG. To understand LambdaMART we’ll look at two aspects: Lambda and MART.

MART

LambdaMART is a specific instance of Gradient Boosted Regression Trees, also referred to as Multiple Additive Regression Trees (MART). Gradient Boosting is a technique for forming a model that is a weighted combination of an ensemble of “weak learners”. In our case, each “weak learner” is a decision tree.

Our goal is to find a function $f(x)$ that minimizes the expected loss $L$:

$\hat{f}(x) = \arg\min_{f(x)}E[L(y, f(x))|x]$

for feature vectors $x$ and labels $y$.

To do so we first view $\hat{f}(x)$ as a sum of weak learners $f_m$:

$\hat{f}(x)=\sum_{m=1}^M f_{m}(x)$

Since we are dealing with decision trees, evaluating $f_{m}(x)$ corresponds to passing $x$ down the tree $f_{m}$ until it reaches a leaf node $l$. The predicted value is then equal to a parameter $\gamma_l$.

Decision trees consist of two types of parameters: region assignments $R_i$ that assign a training example $x_i$ to a leaf node $l$, and leaf outputs $\gamma_l$ that represent the tree’s output for all examples assigned to region $R_l$. Hence we can write each weak learner $f_{m}(x)$ as a parameterized tree $h_{m}(x;R_{m}, \gamma_{m})$, giving:

$\hat{f}(x)=\sum_{m=1}^M f_{m}(x)=\sum_{m=1}^M h_{m}(x; R_{m}, \gamma_{m})$

Gradient Boosting finds each tree in a stepwise fashion. We start with an initial tree $f_1$, then step to another tree $f_2$, and so on:

$\hat{f}_{1}(x) = f_{1}$

$\hat{f}_{2}(x) = f_{1} + f_{2}$

$\hat{f}_{M}(x) = \sum_{m=1}^{M}f_{m}$

How do we determine the steps? At each step, we’d like the model to change such that the loss decreases as much as possible. The locally optimal decrease corresponds to the gradient of the loss with respect to the current model $m$:

$g_{im}=\frac{\partial L[f(x_i), y_i]}{\partial f(x_i)}$

Hence we define the step as $f_m=-\rho_mg_m$. This gives us insight into how Gradient Boosting solves the minimization – the algorithm is performing Gradient Descent in function space.

Now, we know that we want to take a gradient step, but still haven’t said how that translates to finding a tree. To do so, we build a tree that models the gradient; by adding such a tree to the ensemble, we effectively take a gradient step from the previous model. To fit the tree, we use squared error loss, giving the minimization problem:

$argmin_{R, \gamma}\sum_{i=1}^N(-g_{im}-F(x_i;R, \gamma))^2$

Where $F(x_i;R, \gamma)$ denotes a regression tree with parameters $R, \gamma$.

Let’s take a step back. We’ve just set up a framework for finding an ensemble of trees that, when added together, minimizes a loss function. It’s a “framework” in the sense that we can use it if we merely supply gradients of the loss function at each training point. This is where the “Lambda” part of LambdaMART comes into play.

For further reading, Chapter 10 of Elements of Statistical Learning provides a great and thorough overview. also provides a good intro of Gradient Boosting.

Lambda

In ranking, the loss function that we’ll most likely care about optimizing is probably either NDCG, MAP, or MRR. Unfortunately, these loss functions aren’t differentiable at all points, so we can’t use them directly in our Gradient Boosting “framework”, since it’s unclear how we can provide the gradients at each training point.

To address this, LambdaMART uses an idea from a model called LambdaRank. For each training point pair $i,j$, we compute a value $\lambda_{i,j}$ that acts as the gradient we need. Intuitively, $\lambda_{i,j}$ can be thought of as a force that moves documents up and down the ranked list. For instance, if document $i$ is ranked lower than document $j$, then document $i$ will receive a push of size $|\lambda_{i,j}|$ downwards in the ranked list, and $j$ will be pushed upwards the same amount. The LambdaMART paper reports that empirically, the $\lambda$‘s have been used successfully to optimize NDCG, MAP, and MRR.

Hence in LambdaMART we use the Gradient Boosting framework with the $\lambda_i$‘s acting as the gradients $g_i$. This leads to update rules for the leaf values $\gamma_l$ that depend on the $\lambda$ values of the training instances that fall in leaf $l$.

We’ve observed that LambdaMART trains an ensemble of trees sequentially. What does this sequence of trees look like? Let’s look at an ensemble of 19 trees trained with LambdaMART on the OHSUMED dataset:

At each inner node, the small number denotes the feature label, and the larger number denotes the threshold. The number at each leaf node denotes the leaf output.

We can see that each tree is somewhat similar to the previous tree. This makes sense given that each tree $t_i$ is dependent on the model state $F_{i-1}$ since it is fit to the gradient with respect to that model’s gradient.

The step-by-step animation gives us insight into what is happening during training. It’s a great way to visualize and gain an intuition of the gradient boosting process – we see consecutive trees being produced that depend on the previous tree. However, we are still viewing the LambdaMART model as a group of separate trees; we still need to develop a way of viewing the final model cohesively.

Does the sequence animation tell us anything that could help with developing a single, cohesive model view? We can start with the observation that often, the trees look “roughly similar”.

Sometimes two trees will have the same feature at a given node position; for instance the root favors features 7, 9, and 11. Trees may not even change from step-to-step or may share a similar overall structure, such as trees #3 and #4 – notice how the numbers change, but the structure remains the same. Finally, all of the trees only draw from a subset of the total feature set; for instance feature 4 is never used by any of the trees.

Grouping Similar Nodes

Given these observations, we can think about how to build a unified visualization that takes advantage of reused features and similar structure between trees. Let’s consider the root node. We can count the frequency of each feature that occurs as the root node across the entire ensemble, leading to a single node that shows that feature 7 was used as the feature for the root node for 3 trees, feature 9 for 5 trees, and so on:

At each tree level h, there are $2^h$ possible node positions. We can collect the feature frequencies for each of these positions across the ensemble. If a position doesn’t appear in a tree, we mark it as ‘DNE’, and if a position is a leaf, we mark it as ‘Leaf’, such as position (3, 6):

Heatmap Tree

At each node, we have a list of feature (or DNE or Leaf) counts. Heatmaps provide a natural way of visualizing this count information. We can make each node of the ordered-pair-tree into a heatmap, giving rise to the “Heatmap Tree”:

The Heatmap Tree

To view the interactive d3.js visualization, see this link.

The Heatmap Tree shows the entire ensemble as a single unit. Each number is a feature, and the color scale tells us how many trees use that feature at the given node:

By looking at the Heatmap Tree we can get a sense of which features the tree uses when it classifies instances. The ensemble uses features 7, 9, 11, 25, and 29 to perform the initial split, with 11 being used the most often. Further down the tree, we see features 7 and 9 again, along with common features such as 33 and 37. We can easily see that most of the trees in the ensemble are between 5 and 7 levels deep, and that while there are 46 total features, at a given node location the most variation we see is 9 features.

The Heatmap Tree can bring together hundreds of trees, such as this visualization of a 325 tree ensemble:

Tuning Parameters

The Heatmap Tree also lets us see how tuning parameters affect the final model. For instance, as we decrease the learning rate from 0.10 to 0.001, we see the ensemble size fluctuate:

Learning Rate 0.10

Learning Rate 0.01

Learning Rate 0.001

Notice how in the 0.01 case, the Heatmap Tree concisely summarized a 111-tree ensemble.

When we use feature subsampling, we see the number of features at each node increase (in general). Each tree has a different limited subset of features to choose from, leading to a spread in the overall distribution of chosen features:

Feature subsampling 0.6

Feature subsampling and training data subsampling makes this ensemble more crowded:

Feature subsampling 0.6, Data subsampling 0.6

Note that these parameter trends do not necessarily generalize. However, the Heatmap Tree captures all of the trees and features in a single structure, and gives us insight into the structural results of our parameter choices.

Limitations

The Heatmap Tree, unfortunately, has its limits. With wide variation of features and many leaves, the tree becomes crowded:

Since the number of possible node locations at a given level increases exponentially with height, the tree also suffers when trying to visualize deep trees.

Expansions

Another nice aspect of decision trees is that we can visualize how a test instance gets classified; we simply show the path it takes from root to leaf.

How could we visualize the classification process in an ensemble, via the Heatmap Tree or otherwise? With the Heatmap Tree, we would need to be able to simultaneously visualize 10’s or 100’s of paths, since there would be an individual path for every tree in the ensemble. One idea is to have weighted edges on the Heatmap Tree; an edge would become thicker each time the edge is used when classifying an instance.

Another next step is to test the generalizability of this visualization; could it work for any gradient boosting model? What would a Heatmap Tree of a random forest look like?

Conclusion

We’ve taken a close look at LambdaMART and gradient boosting. We’ve devised a way to capture the complexity of gradient boosted tree ensembles in a cohesive way. In doing so we bought back some of the interpretability that we lost by moving from a single tree to an ensemble, gained insight into the training process, and made the black box LambdaMART model a bit more transparent.

To see an interactive d3 Heatmap Tree, visit this link.

From RankNet to LambdaRank to LambdaMART: An Overview

“Tree Ensembles for Learning to Rank”, Yasser Ganjisaffar 2011 PhD Thesis

# Learning to Rank Overview

How does a search engine rank the results that it returns?

How does a translation system ultimately choose a translation for some text?

In the next few posts, we will gain insight into these questions by exploring a sub-domain of machine learning called Learning to Rank. Each of the three questions has a common last step, where an algorithm takes a list of ‘candidates’ and produces a ranked list of results.

In the translation case, we give a system of string of text to translate, and it produces, say, 20 potential candidate translations. Then a Learning to Rank model steps in, ranks those 20 candidates, and returns the top candidate.

In the question-answering case, we give a system a query string, and it produces, say, 100 potential answers. Again we use a Learning to Rank model to produce a ranking of the potential answers, and select the top one as the final answer.

In the search engine case, we type in a search query, and the search engine produces, say, 1000 candidate search results. The order of the final results that you eventually see are produced by a Learning to Rank model that ranks those 1000 candidate results.

To explore the details, we’ll first define the problem addressed in the Learning to Rank domain and survey the methods, datasets, and metrics that are used. Then in the next posts, we’ll look in depth at a state of the art model called LambdaMART, and devise a way of visualizing its training process and the resulting model. Along the way we’ll also develop wrapper code for running a major learning to rank library called JForests and discuss other problems tackled by the field.

# Learning to Rank

As a first step, we’ll define the problem. While learning to rank has numerous applications including machine translation, QA, and even spell checking, we’ll use search engine document ranking as a running example, mainly due to availability of datasets.

## Problem definition

In the context of document retrieval, consider a search engine that accepts a query, and produces a ranking of documents related to the query using a technique such as TF-IDF scoring. Suppose we take the top 100 documents. The goal of learning to rank is to produce a better ranking of these 100 candidate documents for the given query.

Most research has posed learning to rank as a supervised learning problem, although there has been work in semi-supervised ranking. We’ll only focus on the supervised problem here. Given a list of queries and candidate documents with annotated relevance scores, the learning to rank model learns to predict the relevance of candidate documents for a new query, and derives a ranking from the predictions. Let’s define that a bit more precisely.

## More precise problem definition

The data consists of queries, documents, and relevance scores. Let $Q$ denote the set of queries, $D$ denote the set of documents, and $Y$ denote the set of relevance scores.

For each query $q_{i} \in Q$ we have a list of $m$ candidate documents $D_{i}={d_{i,1},d_{i,2},...,d_{i,m}}$. We want to produce a ranking for the documents $D_{i}$, specifically a permutation $\pi_{i}$ of the document indices $\{1,2,...,m\}$. To do so, we want to learn a function $f(q,D)$ that produces an optimal permutation $\pi_{i}$.

As we’ll see, this ranking is produced in different ways; for instance, some models derive the rankings by producing a score for each $(q_{i},d_{i,j})$ pair, while others learn to tell whether the rank of $(q_{i},d_{i,j})$ is higher than $(q_{i},d_{i,k})$.

## Training Set

In the standard training set, each document $d_{i,j}$ is assigned a relevance score $y_{i,j}$ that says how relevant the document is to query $q_{i}$. These scores can be assigned by human annotators, or derived implicitly from sources such as click through data.

We can think of the training set as consisting of 3-tuples $(q_{i}, d_{i,j}, y_{i,j})$; one training example consists of a query, a candidate document, and a measure of how relevant that document is to the query. An example labeling is $y_{i,j} \in \{1, 2, 3, 4, 5\}$, where 1 means “not relevant” and 5 means “very relevant”.

For instance, we could have a query $q_{1}$ = “Leo Tolstoy famous novel”, with candidate documents $D_{1}=\{d_{1,1}=Steppenwolf, d_{1,2}=War\ and\ Peace, d_{1,3}=The\ Cossacks\}$, and relevance labels $y_{1,1}=1,y_{1,2}=5,y_{1,3}=3$.

Each pair $(q_{i}, d_{i, j})$ is actually a feature vector $\phi(q_{i}, d_{i, j})$, so the final training set is defined as $T=\{(\phi(q_{i}, d_{i, j}),y_{i,j})\}$ for $i=1...n,\ j=1...m$, where $n$ is the number of queries, and $m$ is the number of candidate documents for each query.

## A concrete training set

To make this concrete, here’s a few training examples taken from the OHSUMED dataset:

0 qid:1 1:1.000000 2:1.000000 3:0.833333 4:0.871264 5:0 6:0 7:0 8:0.941842 9:1.000000 10:1.000000 11:1.000000 12:1.000000 13:1.000000 14:1.000000 15:1.000000 16:1.000000 17:1.000000 18:0.719697 19:0.729351 20:0 21:0 22:0 23:0.811565 24:1.000000 25:0.972730 26:1.000000 27:1.000000 28:0.922374 29:0.946654 30:0.938888 31:1.000000 32:1.000000 33:0.711276 34:0.722202 35:0 36:0 37:0 38:0.798002 39:1.000000 40:1.000000 41:1.000000 42:1.000000 43:0.959134 44:0.963919 45:0.971425 #docid = 244338
2 qid:1 1:0.600000 2:0.600000 3:1.000000 4:1.000000 5:0 6:0 7:0 8:1.000000 9:0.624834 10:0.767301 11:0.816099 12:0.934805 13:0.649685 14:0.680222 15:0.686762 16:0.421053 17:0.680904 18:1.000000 19:1.000000 20:0 21:0 22:0 23:1.000000 24:0.401391 25:0.938966 26:0.949446 27:0.984769 28:0.955266 29:1.000000 30:0.997786 31:0.441860 32:0.687033 33:1.000000 34:1.000000 35:0 36:0 37:0 38:1.000000 39:0.425450 40:0.975968 41:0.928785 42:0.978524 43:0.979553 44:1.000000 45:1.000000 #docid = 143821
0 qid:1 1:0.400000 2:0.400000 3:0.555555 4:0.563658 5:0 6:0 7:0 8:0.545844 9:0.380576 10:0.427356 11:0.468244 12:0.756579 13:0.366316 14:0.360838 15:0.373909 16:0.210526 17:0.479859 18:0.595237 19:0.608701 20:0 21:0 22:0 23:0.613865 24:0.184562 25:0.791539 26:0.863833 27:0.957024 28:0.896468 29:0.941132 30:0.946305 31:0.232558 32:0.507810 33:0.603068 34:0.616847 35:0 36:0 37:0 38:0.614004 39:0.202374 40:0.812801 41:0.868091 42:0.958879 43:0.926045 44:0.944576 45:0.963753 #docid = 285257

...

0 qid:63 1:0.142857 2:0.162076 3:0.250000 4:0.250000 5:0 6:0 7:0 8:0.167856 9:0.078385 10:0.103707 11:0.132891 12:0.419191 13:0.027399 14:0.027300 15:0.027300 16:0.000000 17:0.000000 18:0.000000 19:0.000000 20:0 21:0 22:0 23:0.000000 24:0.000000 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.000000 30:0.000000 31:0.026316 32:0.063236 33:0.444444 34:0.407079 35:0 36:0 37:0 38:0.189973 39:0.011360 40:0.068814 41:0.066431 42:0.287984 43:0.188175 44:0.000000 45:0.046177 #docid = 173315
2 qid:63 1:0.000000 2:0.000000 3:0.000000 4:0.000000 5:0 6:0 7:0 8:0.000000 9:0.000000 10:0.000000 11:0.000000 12:0.000000 13:0.000000 14:0.000000 15:0.000000 16:0.000000 17:0.000000 18:0.000000 19:0.000000 20:0 21:0 22:0 23:0.000000 24:0.000000 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.000000 30:0.000000 31:0.000000 32:0.000000 33:0.000000 34:0.000000 35:0 36:0 37:0 38:0.000000 39:0.000000 40:0.000000 41:0.000000 42:0.000000 43:0.000000 44:0.000000 45:0.000000 #docid = 9897



We can see that the relevance scores $y_{i,j}$ (the first number of each line) are between 0 and 2. Looking at the original OHSUMED text, we see that the first example is $(y_{1,1},\phi(q_{1}, d_{1, 1}))$ with:

$q_{1}$: “Are there adverse effects on lipids when progesterone is given with estrogen replacement therapy”

$d_{1,1}$= #docid 244338 = “Effects on bone of surgical menopause and estrogen therapy with or without progesterone replacement in cynomolgus monkeys.”

$y_{1,1}$ = 0 = not relevant

and the second example is $(y_{1,2},\phi(q_{1}, d_{1, 2}))$ with:

$d_{1,2}$= #docid 143821 = “Cardiovascular benefits of estrogen replacement therapy.”

$y_{1,1}$ = 2 = very relevant

The text has been converted into the features described on pg.11 of this paper. Note that I only showed the document title here and omitted the body text.

## Test Instance

At test time, we receive a query $q$ and a list of $m$ candidate documents $D={d_{1},d_{2},...,d_{m}}$. We evaluate $f(q,D)$ and return a ranking of the documents $D$, specifically a permutation $\pi$ of the document indices $\{1,2,...,m\}$.

# Learning to Rank Datasets

There are several benchmark datasets for Learning to Rank that can be used to evaluate models.

Microsoft Research released the LETOR 3.0 and LETOR 4.0 datasets. LETOR 3.0 contains data from a 2002 crawl of .gov web pages and associated queries, as well as medical search queries and medical journal documents from the OHSUMED dataset. LETOR 4.0 contains queries from the Million Query Track from TREC and document ids from Gov2, another crawl of .gov web pages. Data is represented as raw feature vectors; information about the features used and the datasets is found here. One thing that I like about the OHSUMED data is that the original text of the queries and documents is available, which can be helpful for interpreting model output or deriving new features. Unfortunately obtaining the original GOV data is nontrivial.

Microsoft Research also released a larger scale dataset consisting 30,000 queries, candidate features, and relevance judgements derived from the Bing search engine. The MSLR-WEB30K dataset contains all of the queries, while a smaller MSLR-WEB10K dataset contains a 10,000 query sample of MSLR-WEB30K. Unfortunately, we are unable to retrieve the original query text and URLs to further interpret results, but the dataset is nevertheless very useful for evaluating a model and producing metrics.

Yahoo sponsored a Learning to Rank competition in 2010 and released a dataset consisting of 36,000 queries and 883,000 document IDs. The datasets are available here, but require a university email to download. Similar to MSLR-WEB, we receive the raw features and relevance scores, and are unable to see the text of the query or URL.

# Learning to Rank Evaluation Metrics

When we run a learning to rank model on a test set to predict rankings, we evaluate the performance using metrics that compare the predicted rankings to the annotated gold-standard labels. Before reviewing the popular learning to rank metrics, let’s introduce notation.

We’ll assume that there are n queries, and that for each query $q_{i}$ we have a list of m candidates $D_{i} =\{d_{i,1},...d_{i,m}\}$ with annotated relevance scores $y_{i,1},...y_{i,m}$. The ranker produces a ranking $\pi_i$, so that $\pi_{i,j}$ represents the predicted ranking of document $d_{i,j}$.

Intuitively, we would like our ranking to have high relevance scores, with the highest scores receiving the highest rankings. Now we’ll look at several metrics that capture these intuitions.

## Normalized Discounted Cumulative Gain (NDCG)

Normalized Discounted Cumulative Gain (NDCG) is a popular learning to rank metric that relies on Cumulative Gain and Discounted Cumulative Gain.

Cumulative Gain

First we’ll look at Cumulative Gain. Suppose we have:

$\pi_i = \{d_{i,1}, d_{i,2}, d_{i,3}, d_{i,4}\}$

$y_{i}= \{5, 2, 5, 0\}$

That is, the top ranked document has a relevance score of 5, the second ranked document has a relevance score of 2, and so on.

The cumulative gain is simply the sum of the relevance scores:

$\sum_{i}y_{i}=5+2+5+0=12$.

Notice that this doesn’t take ranking order into account; if our ranking was reversed:

$\pi_i = \{d_{i,4}, d_{i,3}, d_{i,2}, d_{i,1}\}$

it would receive the same cumulative gain score:

$\sum_{i}y_{i}=0+5+2+5=12$.

Discounted Cumulative Gain

Discounted Cumulative Gain (DCG) takes order into account by ‘rewarding’ high relevance scores that appear in high ranks. A high relevance score appearing at a low rank receives a lower ‘reward’. To compute the metric, we use:

$DCG_k = \sum_{j=1}^{k} \frac{2^{y_{i,\pi_{i,j}}}-1}{log(i+1)}=31+1.9+15.5+0=48.4$

We can see that in position 1, the high relevance score 5 is worth 31, while in position 3 it’s worth 15.5. The lower relevance score 2 is only worth 1.9 in position 2. With DCG order matters; the score would increase if we swapped the rankings of $d_{i,2}$ and $d_{i,3}$.

Normalized Discounted Cumulative Gain

$NDCG_k$ is just a normalized version of $DCG_k$. The normalization is done by finding the $DCG_k$ of an ‘ideal’ ranking. In our example, the ideal ranking would be:

$\pi_i = \{d_{i,1}, d_{i,3}, d_{i,2}, d_{i,4}\}$

giving:

$DCG_{k_{ideal}}=\sum_{j=1}^{k} \frac{2^{y_{i,\pi_{i,j}}}-1}{log(i+1)}=31+19.5+1.5+0=52$

so:

$NDCG_k= \frac{DCG_k}{DCG_{k_{ideal}}}=\frac{48.4}{52}=0.93$

We can see that if our predicted rankings were ideal, they would receive an $NDCG_k$ score of $1$.

## Mean Average Precision (MAP)

Mean average precision (MAP) is another popular metric for learning to rank, specifically for applications with binary relevance scores, e.g. “correct” and “incorrect” in question-answering.

P@k

MAP relies on Precision at k (P@k), which involves counting the number of relevant (or “correct”) documents in the first k ranked positions, and dividing by k:

$P@k=\frac{1}{k}\sum_{i=1}^{k}I(relevance_i==1)$

AP

Average precision is then defined as:

$AP = \frac{\sum_{k=1}^{n}(P@k \times I(relevance_k==1))}{\sum_{k=1}^{n}I(relevance_k==1)}$

MAP

Notice that $AP$ is computed for a single query.  $MAP$ is the mean $AP$ value over all queries:

$MAP = \frac{1}{numQueries}\sum_{q=1}^{numQueries}AP_{q}$.

# Learning to Rank Models

Learning to Rank models can be broadly classified into three groups: scorewise, pairwise, and listwise. We’ll introduce and provide an example of each group. There are many more learning to rank models than the ones covered here.

## Pointwise Models

Pointwise models, also known as ‘scorewise’ models, create rankings by computing a real-valued score for each example $\phi(q_{i},d_{i,j})$, then sort by the score.

An advantage of pointwise models is that they have an interpretable confidence score for each query-document pair. Possible disadvantages are that there is not a notion of grouping the training instances (e.g. by query), and that the prediction target is exact relevance rather than relative position.

Logistic regression, PRank, and MCRank are examples of pointwise models.

## Pairwise Models

Pairwise models learn to decide which of two documents is ranked higher. The models rely on the observation that in learning to rank problems we ultimately care about the relative position of documents rather than exact distances between document scores.

Pairwise models first isolate the documents for each query, then form a training set consisting of pairs of documents, and a label for whether the first document has a higher rank than the second document. So each training example is of the form $(\phi(q_{i},d_{i,j}),\phi(q_{i},d_{i,k}),\{+1,-1\})$. Model-wise, we find $f$ such that $f(d_{i,j})>f(d_{i,k})$ when $y_{i,j}>y_{i,k}$.

The problem is transformed into a binary classification problem; we train a binary classifier to minimize the number of incorrectly ordered pairs.

Several pairwise models exist, each taking different approaches to the problem. Examples include RankNet, SVMRank, and AdaRank. Another model called LambdaMART will be the focus of the next post.

## Listwise Models

Listwise models are trained by optimizing a loss function that measures predicted ranks versus actual ranks for a query’s entire candidate list. A loss function such as MAP or NDCG can be directly optimized by listwise models. Examples include coordinate ascent and LambdaRank.

## Up Next

We’ve introduced the problem of Learning to Rank, and briefly surveyed its algorithms. In the next post, we’ll examine the LambdaMART model, and develop a way of visualizing its training process.

A Short Introduction to Learning to Rank

Learning to Rank QA Data

“Tree Ensembles for Learning to Rank”, Yasser Ganjisaffar 2011 PhD Thesis

Learning to Rank Wikipedia

# Scala Coursera Highlights

I recently completed the Functional Programming Principles in Scala course on Coursera. Along the way, I learned some interesting things ranging from small tidbits to major language features. I’m going to use this post as a chance to highlight and review my favorite things that I learned about Scala from the course. This is meant to be more of a quick review than a set of tutorials, so I’ll provide links to actual tutorials and further reading for each section.

Variance

One thing I liked was seeing abstract programming language concepts concretely being used in Scala. An example of this is the idea of variance, namely covariance and contravariance. Covariance and contravariance appear when investigating whether one type is a subtype of another type. This question comes up concretely in several places, such as deciding whether a type is a valid argument parameter, or deciding which types can be held by a data structure.

Argument Parameters

Functions accept subtypes as arguments. For instance, a function requiring an Animal type will accept Cat or Dog subtypes as arguments. An interesting (and tricky) case is when a function takes a function as an argument, e.g.

def fun[T] (animal: Animal, paramFun: Animal => T) : T

What are valid arguments for fun? The animal argument is clear; it’s just any subtype of Animal. But which functions could we pass in for paramFun? This requires determining the subtype of a function.

A natural guess may be that a function $f1: A => B$ is a subtype of a function $f2: C => D$ when A is a subtype of C and B is a subtype of D. However, this is incorrect!

It turns out that we need to ‘reverse’ the argument subtyping; C must be a subtype of A. Specifically, we have:

$f1: A => B$ is a subtype of $f2: C => D$ when:

• C is a subtype of A
• B is a subtype of D

For instance, the function

def f1(animal: Animal) : Cat

is a subtype of

def f2(cat: Cat) : Animal

In terms of variances, we describe the above by saying that functions are contravariant in their argument types, and covariant in their return types.

The intuition for this is derived from the Liskov Substitution Principle, which is (paraphrased) the idea that a subtype should be expected to do everything that its supertypes can do. We should be able to use f1 in all of the ways that we can use f2.

To connect it back with this post’s topic, it turns out that defining and enforcing variance rules is a language feature in Scala! Scala provides the - and + type annotations for contravariance and covariance, respectively. Thus we can implement, for instance, a one parameter function class, and bake the variance rules into the type parameters:

class Function1[-V, +W] { ... }

The type tells us that Function1 is contravariant in V and covariant in W, as desired.

Type Bounds

Another type-related feature is type bounding. We can create a parameterized function and specify that its parameterized type is a subtype (or supertype) of another type, e.g:

def foo[T :> String] (t: T)

says that T must be a supertype of String, and

def foo[String <: T <: Object] (t: T)

says that T must be a supertype of String and a subtype of Object.

An Example

As an example of where we would use variance and type bounds, suppose we want to add a prepend function to the List class. Looking at Scala’s documentation, we can see that the List are parameterized by a covariant type A:

sealed abstract class List[+A] ...

Now, “prepend“ will add an element to the front of the list; a first guess at its type signature might be:

// compile error
def prepend[A] (a: A) : List[A]

We are attempting to pass a covariant type as a function argument; since function arguments are contravariant this will result in a compile error.

To fix this, we can use a type bound to tell the compiler that we’d like to be able to pass in any supertype B of A, and produce a new list of B’s:

// compiles
def prepend[B >: A] (b: B) : List[B]

For Comprehensions

The functions map, flatMap, and filter are used commonly in functional programming. A common pattern is to chain these functions together, which can lead to bloated and unreadable (albeit effective) code. A simple example:

(1 until 10).flatMap(x =>
(1 until 20).filter(y => y >= 10)
.map(y => (x,y)))

Scala provides for comprehensions as a way of chaining map, flatMap, and filter operations together. Broadly stated, for comprehensions are syntactic sugar that map ‘arrow’, ‘if’, and ‘yield’ to map, filterWith, and flatMap, leading to more concise and readable code:

for {
x <- (1 until 10)
y <- (1 until 20)
if y >= 10
} yield (x, y)

Any type that supports map,filter, and flatMap is eligible to be used in a for comprehension, such as List, Future, Map, and many many more. For those familiar with Haskell, Scala’s for comprehensions are analogous to Haskell’s do notation, which eliminates the clutter of explicitly writing binds.

Call by Optional

By default, Scala functions are call by value, meaning that function parameters will be reduced prior to evaluating the function body. However, Scala let’s you override this default by adding => before a parameter’s type:

// x is call by value
// y is call by value
def fun1(x: Int, y: Int)

// x is call by value
// y is call by name
def fun2(x: Int, y: => Int)

The y parameter in fun2 will be call by name, meaning that it will only be evaluated if and when it’s used in the function body.

While this is a relatively minor feature, it speaks to a general theme that I’ve noticed with Scala: the language provides you with many tools, opening up multiple ways to solve a problem. While this flexibility can be misused or confusing, so far I’ve found that it contributes to Scala being a great practical language. Code can be structured using functional ideas, while still incorporating object-oriented ideas and allowing access to the extensive ecosystem of Java libraries.

Here are some links with more details on these areas:

Variance and Type Bounds:

For Comprehensions:

There are also some good general Scala resources that the course pointed to:

# LDAOverflow with Online LDA

In the past two posts (part I and part II), we used Latent Dirichlet Allocation (LDA) to discover topics for tweets, and visualized them. In this post, we’ll investigate using LDA on an 8gb dataset of around 8 million Stack Overflow posts.

We’ll need to take a different approach; for the tweets we used a batch algorithm that worked well for the relatively small dataset of around 5000 tweets, but would likely introduce performance issues when running on massive datasets. The batch algorithm also assumed that we have the entire training set at the start of training, making the approach unviable for streaming data, which we may receive during training. It’d be nice to train the model incrementally, so that we can train on a chunk of data, then resume training if we receive more data without have to retrain on the original chunk.

In this post, we’ll look at Online LDA, a variation of ‘vanilla’ LDA that can be trained incrementally in small batches. Online LDA is a good choice for large datasets since we only need to hold a very small subset of the dataset in memory at a given time, and a good fit for streaming data since we can continually feed in new data batches as we receive them. We’re also able to save the model state at a point in training, then resume later when we want to train on more data.

First, we’ll jump into the math and look at the differences between online and batch LDA. Then we’ll use a python implementation of online LDA to discover topics for the Stack Overflow dataset. As usual, all of the associated code is available on GitHub.

Variations on Variational Bayes

For brevity this part will assume that you’ve read through the math background in the first LDA post. I’ll also only highlight major parts; for the full story check out Hoffman’s online LDA paper.

In LDA, our ultimate goal is to find the posterior distribution of latent topic variables after observing training data. However, computing this distribution is intractable, so we’re forced to approximate. One approximation approach is to use an optimization method called Variational Bayes.

In short, we approximate the true distribution by a simple distribution $q$, and associate parameters $\phi,\ \gamma,\ \lambda$ with the original parameters $z,\ \theta,\ \beta$ respectively. Recall that $z$ gives the topic assignments for each word in each document , $\theta$ gives the topic composition of each document, and $\beta$ gives the word-topic probabilities for each word and each topic.

Specifically, we have:

$q(z_{di}=k) = \phi_{dw_{di}k}$

$q(\theta_{d}) = dirichlet(\theta_{d}; \gamma_{d})$

$q(\beta_{k}) = dirichlet(\beta_{k}; \lambda_{k})$

Our goal is to estimate $\phi,\ \gamma,\ \lambda$. In both batch and online LDA, we alternate between two steps:

1. E-Step: Estimate $\phi,\ \gamma$ using the current value of $\lambda$

2. M-Step: Update $\lambda$ , using the current value of $\phi$

The core difference between batch and online LDA is in how these steps are carried out at the algorithmic level.

Starting with Batch

In batch Variational Bayes, we perform multiple passes over the entire dataset, checking each time for convergence. During each pass, the algorithm does an E-Step using the entire dataset. At a high level:

E-Step
for d = 1 to numDocs
initialize $\gamma$
repeat until change in $\phi < \epsilon$
update $\phi$
update $\gamma$

Then the M-Step updates $\lambda$ using $\phi$ values from every document:

M-Step
update $\lambda$

$\phi_{d,word_{i},k} \propto e^{E_{q}(log \theta_{d,k})+E_{q}(log\beta_{k,word_{i}})}$

$\gamma_{d,k} = \alpha + \sum_{word_{i}}\phi_{d,word_{i},k}n_{d,word_{i}}$

$\lambda_{k,word_{i}}=\eta +\sum_{d}n_{d,word_{i}}\phi_{d,word_{i},k}$

Where $n_{d,word_{i}}$ is the number of occurrences of $word_{i}$ in document d.

Going Online

In online Variational Bayes, we only make a single sweep of the entire dataset, analyzing a chunk of documents at a time. A ‘chunk’ could be a single document, 42 documents, or even the entire dataset. Let’s let a ‘chunk’ be 1000 documents.

The online E-Step only uses the current chunk; instead of 8 million posts we now only have to hold 1000 in memory. The E-Step finds locally optimal values for $\phi$ and $\gamma:$

E-Step
initialize $\gamma$
repeat until change in $\phi < \epsilon$
update $\phi$
update $\gamma$

In the M-Step, we first compute $\lambda'$, which is the value of $\lambda$ if we imagined that the entire dataset is made up of $\frac{numDocs}{chunkSize}$ copies of the current chunk. Then $\lambda$ is updated using a weighted sum of $\lambda'$ and $\lambda:$

M-Step
compute $\lambda'$
update $\lambda$

$\phi_{iter,word_{i},k} \propto e^{E_{q}(log \theta_{iter,k})+E_{q}(log\beta_{k,word_{i}})}$

$\gamma_{iter,k} = \alpha + \sum_{word_{i}}\phi_{iter,word_{i},k}n_{iter,word_{i}}$

$\lambda'_{k,word_{i}}=\eta +batchSize*n_{iter,word_{i}}\phi_{iter,word_{i},k}$

$\lambda = (1-\rho_t)\lambda+\rho_t\lambda'$

Where $n_{iter,word_{i}}$ is the number of occurrences of $word_{i}$ in the current iteration’s chunk of documents, and $\rho_t$ is a weighting parameter.

We can see that unlike batch LDA, in online LDA we only need to hold a small chunk of the data at a time, and once we’re done analyzing it, we never need it again. As with batch, once we’ve estimated $\lambda$, we can find the most probable words for each topic by looking at the word probabilities in each row of $\lambda$.

Intuitions of the Inference

If we squint and step back, LDA consists of using simple word counts in a clever way. The two parameters we ultimately care about are $\gamma$ and $\lambda$. How do these get updated during training?

Updates of $\gamma$ (the topic compositions for each document) are the prior $\alpha$ plus a weighted sum of word counts. The word counts are weighted by $\phi_{word,topic}$, the probability of assigning the word to the topic. Intuitively, if we count a lot of instances of “potato” in a document, and “potato” is likely to be assigned to topic 2, then it makes sense that the document has more of topic 2 in it than we previously thought.

Updates of $\lambda_{topic,word}$ (the word-topic probabilities) use word counts weighted by the probability that the word will be assigned to the given topic. If “potato” shows up a lot in the dataset is likely to be assigned to topic 2, then it makes sense that $\lambda_{2, potato}$ should increase.

LDA Overflow

Now it’s time to run Online LDA on the Stack Overflow dataset to discover topics without overflowing our memory. Stack Exchange kindly provides (and updates) a data dump of all of its user generated content; I chose the stackoverflow.com-Posts.7z dataset.

The data arrives as a 27gb XML behemoth. The first step is isolating the text from the Title and Body fields for each row. These fields will comprise a ‘document’, and our dataset will be formatted as a text file with one document per line.

Since the file is so large, we need to incrementally read the XML. We also filter out non alpha-numeric characters. Details for this process can be found in xml_parse.py.

Once xml_parse.py runs, we get an 8gb text file containing around 8,000,000 stack overflow documents (title and body content). A couple examples:

Throw an error in a MySQL trigger If I have a trigger before the update on a table how can I throw an error that prevents the update on that table

Compressing Decompressing Folders Files Does anyone know of a good way to compress or decompress files and folders in C quickly Handling large files might be necessary

LDA by Hoffman

We’ll use a Python implementation of online LDA written by Matt Hoffman, available on his webpage. We need to adapt the high-level running script for our application; to do so I created a wrapper for running LDA called online_lda.py. Use

python online_lda.py -h

to see the various command line arguments.

I’ve also added more comments to onlineldavb.py on the repo in case you’d like to further inspect how the actual Online LDA algorithm is implemented.

Building a Vocabulary

The LDA implementation assumes that we have a vocabulary file prior to training so that it can compactly represent documents as numeric word IDs. The vocabulary also allows us the algorithm to associate an index of $\beta$ with a word ID and hence with a word.

We can generate a domain-specific vocabulary using the first 100,000 Stack Overflow posts, and supplement it with the vocabulary provided by Hoffman, which contains the most frequent English words. Gensim has a nice library for creating vocabularies. We filter out words that appear in fewer than 10 documents, since they are often ‘junk’ words, and would probably not appear in the top words for a topic anyways since they appear so infrequently. Code for the vocabulary generation is found in dictionary.py.

Running

Let’s kick it off!

python online_lda.py dataset.txt vocabulary.txt

The training took ~12 hours for a 100 topic model on my MacBook. The values of $\gamma$ and $\lambda$ are output to files every 10,000 iterations and when the training completes. We can then use one of the $\lambda$ files to see the top 20 words for the top N topics. For instance, to print the top 2 topics with the final model, use:

python printtopics.py vocabulary.txt lambda-final.dat 2

giving:

topic 0
suspended:0.8356
authorization:0.0215
entityset:0.0128
treemap:0.0094
professionals:0.0086
best:0.0084
facts:0.0072
special:0.0062
syntax:0.0056
listing:0.0051
forwarding:0.0049
webparts:0.0047
duration:0.0045
valued:0.0039
halts:0.0038
baggage:0.0034
yeah:0.0034
ltaspdropdownlistgt:0.0033
liable:0.0030

topic 1
support:0.7800
quarter:0.0380
fig:0.0278
luck:0.0160
1gb:0.0142
funeral:0.0124
visiting:0.0109
xiv:0.0071
screen:0.0063
commons:0.0046
monster:0.0040
flash:0.0039
faculty:0.0037
desire:0.0031
detached:0.0030
handler:0.0028
say:0.0028
everyday:0.0025
darker:0.0025
screen:0.0024

The numbers are the word-topic probabilities from $\lambda$.

We’ll use the approach from the first LDA post to create word clouds for two different topics:

Conclusion

We managed to find topics on a dataset of 8 million Stack Overflow posts by using Online LDA. Feel free to download the code and try it out on other datasets!

Much of the material is derived from Hoffman’s paper and his online LDA implementation. Gensim and Vowpal Wabbit also have implementations of online LDA.

# These Are Your Tweets on LDA (Part II)

In the last post, I gave an overview of Latent Dirichlet Allocation (LDA), and walked through an application of LDA on @BarackObama’s tweets. The final product was a set of word clouds, one per topic, that showed the weighted words that defined the topic.

In this post, we’ll develop a dynamic visualization that incorporates multiple topics, allowing us to gain of a high level view of the topics and also drill down to see the words that define each topic. Through a simple web interface, we’ll also be able to view data from different twitter users.

As before, all of the code is available on GitHub. The visualization-related code is found in the viz/static directory.

Harnessing the Data

In the last post, we downloaded tweets for a user and found 50 topics that occur in the user’s tweets along with the top 20 words for each topic. We also found the composition of topics across all of the tweets, allowing us to rank the topics by prominence. For our visualization, we’ll choose to display the 10 highest ranked topics for a given twitter user name.

We need a visualization that can show multiple groupings of data. Each of the 10 groupings has 20 words, so we’d also like one that avoids the potential information overload. Finally, we’d like to incorporate the frequencies that we have for each word.

d3

A good fit for these requirements is d3.js‘s Zoomable Pack Layout, which gives us a high level view of each grouping as a bubble. Upon clicking a bubble, we can see the data that comprises the bubble, as well as each data point’s relative weight:

d3 Zoomable Pack Layout

In our case, each top-level bubble is a topic, and each inner bubble is a word, with its relative size determined by the word’s frequency.

Since the d3 visualization takes JSON as input, in order to plug in our LDA output data we simply create a toJSON() method in TopicModel.java that outputs the data associated with the top 10 topics to a JSON file. The ‘name’ of each topic is simply the most probably word in the topic.

Now, when the LDA process (the main() method in TopicModel.java) is run for a twitter user, the code will create a corresponding JSON file in viz/json. The JSON structure:

{
"name":"",
"children":[
{
"name": {topic_1_name},
"children":[
{
"name": {topic_1_word_1},
"size": {topic_1_word_1_freq}
},
{
"name": {topic_1_word_2},
"size": {topic_1_word_2_freq}
},
{
"name": {topic_1_word_3},
"size": {topic_1_word_3_freq}
},
....

Javascripting

Now, we make slight modifications to the javascript code embedded in the given d3 visualization. Our goal is to be able to toggle between results for different twitter users; we’d like to switch from investigating the @nytimes topics to getting a sense of what @KingJames tweets about.

To do so, we add a drop-down to index.html, such that each time a user is selected on the drop-down, their corresponding JSON is loaded by the show() function in viz.js. Hence we also change the show() function to reload the visualization each time it is called.

Making The Visualizations Visible

To run the code locally, navigate to the viz/static directory and start an HTTP server to serve the content, e.g.

cd {project_root}/viz/static
python -m SimpleHTTPServer

then navigate to http://localhost:8000/index.html to see the visualization.

By selecting nytimes, we see the following visualization which gives a sense of the topics:

Upon clicking the ‘gaza’ topic, we see the top words that comprise the topic:

I’ve also used Heroku to put an example of the finished visualization with data from 10 different twitter usernames here:

http://ldaviz.herokuapp.com/

Have fun exploring the various topics!