LDAOverflow with Online LDA

In the past two posts (part I and part II), we used Latent Dirichlet Allocation (LDA) to discover topics for tweets, and visualized them. In this post, we’ll investigate using LDA on an 8gb dataset of around 8 million Stack Overflow posts.

We’ll need to take a different approach; for the tweets we used a batch algorithm that worked well for the relatively small dataset of around 5000 tweets, but would likely introduce performance issues when running on massive datasets. The batch algorithm also assumed that we have the entire training set at the start of training, making the approach unviable for streaming data, which we may receive during training. It’d be nice to train the model incrementally, so that we can train on a chunk of data, then resume training if we receive more data without have to retrain on the original chunk.

In this post, we’ll look at Online LDA, a variation of ‘vanilla’ LDA that can be trained incrementally in small batches. Online LDA is a good choice for large datasets since we only need to hold a very small subset of the dataset in memory at a given time, and a good fit for streaming data since we can continually feed in new data batches as we receive them. We’re also able to save the model state at a point in training, then resume later when we want to train on more data.

First, we’ll jump into the math and look at the differences between online and batch LDA. Then we’ll use a python implementation of online LDA to discover topics for the Stack Overflow dataset. As usual, all of the associated code is available on GitHub.

Variations on Variational Bayes

For brevity this part will assume that you’ve read through the math background in the first LDA post. I’ll also only highlight major parts; for the full story check out Hoffman’s online LDA paper.

In LDA, our ultimate goal is to find the posterior distribution of latent topic variables after observing training data. However, computing this distribution is intractable, so we’re forced to approximate. One approximation approach is to use an optimization method called Variational Bayes.

In short, we approximate the true distribution by a simple distribution q, and associate parameters \phi,\ \gamma,\ \lambda with the original parameters z,\ \theta,\ \beta respectively. Recall that z gives the topic assignments for each word in each document , \theta gives the topic composition of each document, and \beta gives the word-topic probabilities for each word and each topic.

Specifically, we have:

q(z_{di}=k) = \phi_{dw_{di}k}

q(\theta_{d}) = dirichlet(\theta_{d}; \gamma_{d})

q(\beta_{k}) = dirichlet(\beta_{k}; \lambda_{k})

Our goal is to estimate \phi,\ \gamma,\ \lambda. In both batch and online LDA, we alternate between two steps:

1. E-Step: Estimate \phi,\ \gamma using the current value of \lambda

2. M-Step: Update \lambda , using the current value of \phi

The core difference between batch and online LDA is in how these steps are carried out at the algorithmic level.

Starting with Batch

In batch Variational Bayes, we perform multiple passes over the entire dataset, checking each time for convergence. During each pass, the algorithm does an E-Step using the entire dataset. At a high level:

for d = 1 to numDocs
    initialize \gamma
    repeat until change in \phi < \epsilon
        update \phi
        update \gamma

Then the M-Step updates \lambda using \phi values from every document:

update \lambda

The specific updates are:

\phi_{d,word_{i},k} \propto e^{E_{q}(log \theta_{d,k})+E_{q}(log\beta_{k,word_{i}})}

\gamma_{d,k} = \alpha + \sum_{word_{i}}\phi_{d,word_{i},k}n_{d,word_{i}}

\lambda_{k,word_{i}}=\eta +\sum_{d}n_{d,word_{i}}\phi_{d,word_{i},k}

Where n_{d,word_{i}} is the number of occurrences of word_{i} in document d.

Going Online

In online Variational Bayes, we only make a single sweep of the entire dataset, analyzing a chunk of documents at a time. A ‘chunk’ could be a single document, 42 documents, or even the entire dataset. Let’s let a ‘chunk’ be 1000 documents.

The online E-Step only uses the current chunk; instead of 8 million posts we now only have to hold 1000 in memory. The E-Step finds locally optimal values for \phi and \gamma:

initialize \gamma
repeat until change in \phi < \epsilon
    update \phi
    update \gamma

In the M-Step, we first compute \lambda', which is the value of \lambda if we imagined that the entire dataset is made up of \frac{numDocs}{chunkSize} copies of the current chunk. Then \lambda is updated using a weighted sum of \lambda' and \lambda:

compute \lambda'
update \lambda

The specific updates are:

\phi_{iter,word_{i},k} \propto e^{E_{q}(log \theta_{iter,k})+E_{q}(log\beta_{k,word_{i}})}

\gamma_{iter,k} = \alpha + \sum_{word_{i}}\phi_{iter,word_{i},k}n_{iter,word_{i}}

\lambda'_{k,word_{i}}=\eta +batchSize*n_{iter,word_{i}}\phi_{iter,word_{i},k}

\lambda = (1-\rho_t)\lambda+\rho_t\lambda'

Where n_{iter,word_{i}} is the number of occurrences of word_{i} in the current iteration’s chunk of documents, and \rho_t is a weighting parameter.

We can see that unlike batch LDA, in online LDA we only need to hold a small chunk of the data at a time, and once we’re done analyzing it, we never need it again. As with batch, once we’ve estimated \lambda, we can find the most probable words for each topic by looking at the word probabilities in each row of \lambda.

Intuitions of the Inference

If we squint and step back, LDA consists of using simple word counts in a clever way. The two parameters we ultimately care about are \gamma and \lambda. How do these get updated during training?

Updates of \gamma (the topic compositions for each document) are the prior \alpha plus a weighted sum of word counts. The word counts are weighted by \phi_{word,topic}, the probability of assigning the word to the topic. Intuitively, if we count a lot of instances of “potato” in a document, and “potato” is likely to be assigned to topic 2, then it makes sense that the document has more of topic 2 in it than we previously thought.

Updates of \lambda_{topic,word} (the word-topic probabilities) use word counts weighted by the probability that the word will be assigned to the given topic. If “potato” shows up a lot in the dataset is likely to be assigned to topic 2, then it makes sense that \lambda_{2, potato} should increase.

LDA Overflow

Now it’s time to run Online LDA on the Stack Overflow dataset to discover topics without overflowing our memory. Stack Exchange kindly provides (and updates) a data dump of all of its user generated content; I chose the stackoverflow.com-Posts.7z dataset.

Read, Clean, Parse, Repeat

The data arrives as a 27gb XML behemoth. The first step is isolating the text from the Title and Body fields for each row. These fields will comprise a ‘document’, and our dataset will be formatted as a text file with one document per line.

Since the file is so large, we need to incrementally read the XML. We also filter out non alpha-numeric characters. Details for this process can be found in xml_parse.py.

Once xml_parse.py runs, we get an 8gb text file containing around 8,000,000 stack overflow documents (title and body content). A couple examples:

Throw an error in a MySQL trigger If I have a trigger before the update on a table how can I throw an error that prevents the update on that table

Compressing Decompressing Folders Files Does anyone know of a good way to compress or decompress files and folders in C quickly Handling large files might be necessary

LDA by Hoffman

We’ll use a Python implementation of online LDA written by Matt Hoffman, available on his webpage. We need to adapt the high-level running script for our application; to do so I created a wrapper for running LDA called online_lda.py. Use

python online_lda.py -h

to see the various command line arguments.

I’ve also added more comments to onlineldavb.py on the repo in case you’d like to further inspect how the actual Online LDA algorithm is implemented.

Building a Vocabulary

The LDA implementation assumes that we have a vocabulary file prior to training so that it can compactly represent documents as numeric word IDs. The vocabulary also allows us the algorithm to associate an index of \beta with a word ID and hence with a word.

We can generate a domain-specific vocabulary using the first 100,000 Stack Overflow posts, and supplement it with the vocabulary provided by Hoffman, which contains the most frequent English words. Gensim has a nice library for creating vocabularies. We filter out words that appear in fewer than 10 documents, since they are often ‘junk’ words, and would probably not appear in the top words for a topic anyways since they appear so infrequently. Code for the vocabulary generation is found in dictionary.py.


Let’s kick it off!

python online_lda.py dataset.txt vocabulary.txt

The training took ~12 hours for a 100 topic model on my MacBook. The values of \gamma and \lambda are output to files every 10,000 iterations and when the training completes. We can then use one of the \lambda files to see the top 20 words for the top N topics. For instance, to print the top 2 topics with the final model, use:

python printtopics.py vocabulary.txt lambda-final.dat 2


topic 0

topic 1

The numbers are the word-topic probabilities from \lambda.

We’ll use the approach from the first LDA post to create word clouds for two different topics:



We managed to find topics on a dataset of 8 million Stack Overflow posts by using Online LDA. Feel free to download the code and try it out on other datasets!

Credits & Links

Much of the material is derived from Hoffman’s paper and his online LDA implementation. Gensim and Vowpal Wabbit also have implementations of online LDA.