LDAOverflow with Online LDA

In the past two posts (part I and part II), we used Latent Dirichlet Allocation (LDA) to discover topics for tweets, and visualized them. In this post, we’ll investigate using LDA on an 8gb dataset of around 8 million Stack Overflow posts.

We’ll need to take a different approach; for the tweets we used a batch algorithm that worked well for the relatively small dataset of around 5000 tweets, but would likely introduce performance issues when running on massive datasets. The batch algorithm also assumed that we have the entire training set at the start of training, making the approach unviable for streaming data, which we may receive during training. It’d be nice to train the model incrementally, so that we can train on a chunk of data, then resume training if we receive more data without have to retrain on the original chunk.

In this post, we’ll look at Online LDA, a variation of ‘vanilla’ LDA that can be trained incrementally in small batches. Online LDA is a good choice for large datasets since we only need to hold a very small subset of the dataset in memory at a given time, and a good fit for streaming data since we can continually feed in new data batches as we receive them. We’re also able to save the model state at a point in training, then resume later when we want to train on more data.

First, we’ll jump into the math and look at the differences between online and batch LDA. Then we’ll use a python implementation of online LDA to discover topics for the Stack Overflow dataset. As usual, all of the associated code is available on GitHub.

Variations on Variational Bayes

For brevity this part will assume that you’ve read through the math background in the first LDA post. I’ll also only highlight major parts; for the full story check out Hoffman’s online LDA paper.

In LDA, our ultimate goal is to find the posterior distribution of latent topic variables after observing training data. However, computing this distribution is intractable, so we’re forced to approximate. One approximation approach is to use an optimization method called Variational Bayes.

In short, we approximate the true distribution by a simple distribution $q$, and associate parameters $\phi,\ \gamma,\ \lambda$ with the original parameters $z,\ \theta,\ \beta$ respectively. Recall that $z$ gives the topic assignments for each word in each document , $\theta$ gives the topic composition of each document, and $\beta$ gives the word-topic probabilities for each word and each topic.

Specifically, we have: $q(z_{di}=k) = \phi_{dw_{di}k}$ $q(\theta_{d}) = dirichlet(\theta_{d}; \gamma_{d})$ $q(\beta_{k}) = dirichlet(\beta_{k}; \lambda_{k})$

Our goal is to estimate $\phi,\ \gamma,\ \lambda$. In both batch and online LDA, we alternate between two steps:

1. E-Step: Estimate $\phi,\ \gamma$ using the current value of $\lambda$

2. M-Step: Update $\lambda$ , using the current value of $\phi$

The core difference between batch and online LDA is in how these steps are carried out at the algorithmic level.

Starting with Batch

In batch Variational Bayes, we perform multiple passes over the entire dataset, checking each time for convergence. During each pass, the algorithm does an E-Step using the entire dataset. At a high level:

E-Step
for d = 1 to numDocs
initialize $\gamma$
repeat until change in $\phi < \epsilon$
update $\phi$
update $\gamma$

Then the M-Step updates $\lambda$ using $\phi$ values from every document:

M-Step
update $\lambda$ $\phi_{d,word_{i},k} \propto e^{E_{q}(log \theta_{d,k})+E_{q}(log\beta_{k,word_{i}})}$ $\gamma_{d,k} = \alpha + \sum_{word_{i}}\phi_{d,word_{i},k}n_{d,word_{i}}$ $\lambda_{k,word_{i}}=\eta +\sum_{d}n_{d,word_{i}}\phi_{d,word_{i},k}$

Where $n_{d,word_{i}}$ is the number of occurrences of $word_{i}$ in document d.

Going Online

In online Variational Bayes, we only make a single sweep of the entire dataset, analyzing a chunk of documents at a time. A ‘chunk’ could be a single document, 42 documents, or even the entire dataset. Let’s let a ‘chunk’ be 1000 documents.

The online E-Step only uses the current chunk; instead of 8 million posts we now only have to hold 1000 in memory. The E-Step finds locally optimal values for $\phi$ and $\gamma:$

E-Step
initialize $\gamma$
repeat until change in $\phi < \epsilon$
update $\phi$
update $\gamma$

In the M-Step, we first compute $\lambda'$, which is the value of $\lambda$ if we imagined that the entire dataset is made up of $\frac{numDocs}{chunkSize}$ copies of the current chunk. Then $\lambda$ is updated using a weighted sum of $\lambda'$ and $\lambda:$

M-Step
compute $\lambda'$
update $\lambda$ $\phi_{iter,word_{i},k} \propto e^{E_{q}(log \theta_{iter,k})+E_{q}(log\beta_{k,word_{i}})}$ $\gamma_{iter,k} = \alpha + \sum_{word_{i}}\phi_{iter,word_{i},k}n_{iter,word_{i}}$ $\lambda'_{k,word_{i}}=\eta +batchSize*n_{iter,word_{i}}\phi_{iter,word_{i},k}$ $\lambda = (1-\rho_t)\lambda+\rho_t\lambda'$

Where $n_{iter,word_{i}}$ is the number of occurrences of $word_{i}$ in the current iteration’s chunk of documents, and $\rho_t$ is a weighting parameter.

We can see that unlike batch LDA, in online LDA we only need to hold a small chunk of the data at a time, and once we’re done analyzing it, we never need it again. As with batch, once we’ve estimated $\lambda$, we can find the most probable words for each topic by looking at the word probabilities in each row of $\lambda$.

Intuitions of the Inference

If we squint and step back, LDA consists of using simple word counts in a clever way. The two parameters we ultimately care about are $\gamma$ and $\lambda$. How do these get updated during training?

Updates of $\gamma$ (the topic compositions for each document) are the prior $\alpha$ plus a weighted sum of word counts. The word counts are weighted by $\phi_{word,topic}$, the probability of assigning the word to the topic. Intuitively, if we count a lot of instances of “potato” in a document, and “potato” is likely to be assigned to topic 2, then it makes sense that the document has more of topic 2 in it than we previously thought.

Updates of $\lambda_{topic,word}$ (the word-topic probabilities) use word counts weighted by the probability that the word will be assigned to the given topic. If “potato” shows up a lot in the dataset is likely to be assigned to topic 2, then it makes sense that $\lambda_{2, potato}$ should increase.

LDA Overflow

Now it’s time to run Online LDA on the Stack Overflow dataset to discover topics without overflowing our memory. Stack Exchange kindly provides (and updates) a data dump of all of its user generated content; I chose the stackoverflow.com-Posts.7z dataset.

The data arrives as a 27gb XML behemoth. The first step is isolating the text from the Title and Body fields for each row. These fields will comprise a ‘document’, and our dataset will be formatted as a text file with one document per line.

Since the file is so large, we need to incrementally read the XML. We also filter out non alpha-numeric characters. Details for this process can be found in xml_parse.py.

Once xml_parse.py runs, we get an 8gb text file containing around 8,000,000 stack overflow documents (title and body content). A couple examples:

Throw an error in a MySQL trigger If I have a trigger before the update on a table how can I throw an error that prevents the update on that table

Compressing Decompressing Folders Files Does anyone know of a good way to compress or decompress files and folders in C quickly Handling large files might be necessary

LDA by Hoffman

We’ll use a Python implementation of online LDA written by Matt Hoffman, available on his webpage. We need to adapt the high-level running script for our application; to do so I created a wrapper for running LDA called online_lda.py. Use

python online_lda.py -h

to see the various command line arguments.

I’ve also added more comments to onlineldavb.py on the repo in case you’d like to further inspect how the actual Online LDA algorithm is implemented.

Building a Vocabulary

The LDA implementation assumes that we have a vocabulary file prior to training so that it can compactly represent documents as numeric word IDs. The vocabulary also allows us the algorithm to associate an index of $\beta$ with a word ID and hence with a word.

We can generate a domain-specific vocabulary using the first 100,000 Stack Overflow posts, and supplement it with the vocabulary provided by Hoffman, which contains the most frequent English words. Gensim has a nice library for creating vocabularies. We filter out words that appear in fewer than 10 documents, since they are often ‘junk’ words, and would probably not appear in the top words for a topic anyways since they appear so infrequently. Code for the vocabulary generation is found in dictionary.py.

Running

Let’s kick it off!

python online_lda.py dataset.txt vocabulary.txt

The training took ~12 hours for a 100 topic model on my MacBook. The values of $\gamma$ and $\lambda$ are output to files every 10,000 iterations and when the training completes. We can then use one of the $\lambda$ files to see the top 20 words for the top N topics. For instance, to print the top 2 topics with the final model, use:

python printtopics.py vocabulary.txt lambda-final.dat 2

giving:

topic 0
suspended:0.8356
authorization:0.0215
entityset:0.0128
treemap:0.0094
professionals:0.0086
best:0.0084
facts:0.0072
special:0.0062
syntax:0.0056
listing:0.0051
forwarding:0.0049
webparts:0.0047
duration:0.0045
valued:0.0039
halts:0.0038
baggage:0.0034
yeah:0.0034
ltaspdropdownlistgt:0.0033
liable:0.0030

topic 1
support:0.7800
quarter:0.0380
fig:0.0278
luck:0.0160
1gb:0.0142
funeral:0.0124
visiting:0.0109
xiv:0.0071
screen:0.0063
commons:0.0046
monster:0.0040
flash:0.0039
faculty:0.0037
desire:0.0031
detached:0.0030
handler:0.0028
say:0.0028
everyday:0.0025
darker:0.0025
screen:0.0024

The numbers are the word-topic probabilities from $\lambda$.

We’ll use the approach from the first LDA post to create word clouds for two different topics: We managed to find topics on a dataset of 8 million Stack Overflow posts by using Online LDA. Feel free to download the code and try it out on other datasets!