In the past two posts (part I and part II), we used Latent Dirichlet Allocation (LDA) to discover topics for tweets, and visualized them. In this post, we’ll investigate using LDA on an 8gb dataset of around 8 million Stack Overflow posts.
We’ll need to take a different approach; for the tweets we used a batch algorithm that worked well for the relatively small dataset of around 5000 tweets, but would likely introduce performance issues when running on massive datasets. The batch algorithm also assumed that we have the entire training set at the start of training, making the approach unviable for streaming data, which we may receive during training. It’d be nice to train the model incrementally, so that we can train on a chunk of data, then resume training if we receive more data without have to retrain on the original chunk.
In this post, we’ll look at Online LDA, a variation of ‘vanilla’ LDA that can be trained incrementally in small batches. Online LDA is a good choice for large datasets since we only need to hold a very small subset of the dataset in memory at a given time, and a good fit for streaming data since we can continually feed in new data batches as we receive them. We’re also able to save the model state at a point in training, then resume later when we want to train on more data.
First, we’ll jump into the math and look at the differences between online and batch LDA. Then we’ll use a python implementation of online LDA to discover topics for the Stack Overflow dataset. As usual, all of the associated code is available on GitHub.
Variations on Variational Bayes
In LDA, our ultimate goal is to find the posterior distribution of latent topic variables after observing training data. However, computing this distribution is intractable, so we’re forced to approximate. One approximation approach is to use an optimization method called Variational Bayes.
In short, we approximate the true distribution by a simple distribution , and associate parameters with the original parameters respectively. Recall that gives the topic assignments for each word in each document , gives the topic composition of each document, and gives the word-topic probabilities for each word and each topic.
Specifically, we have:
Our goal is to estimate . In both batch and online LDA, we alternate between two steps:
1. E-Step: Estimate using the current value of
2. M-Step: Update , using the current value of
The core difference between batch and online LDA is in how these steps are carried out at the algorithmic level.
Starting with Batch
In batch Variational Bayes, we perform multiple passes over the entire dataset, checking each time for convergence. During each pass, the algorithm does an E-Step using the entire dataset. At a high level:
E-Step for d = 1 to numDocs initialize repeat until change in update update
Then the M-Step updates using values from every document:
The specific updates are:
Where is the number of occurrences of in document d.
In online Variational Bayes, we only make a single sweep of the entire dataset, analyzing a chunk of documents at a time. A ‘chunk’ could be a single document, 42 documents, or even the entire dataset. Let’s let a ‘chunk’ be 1000 documents.
The online E-Step only uses the current chunk; instead of 8 million posts we now only have to hold 1000 in memory. The E-Step finds locally optimal values for and
E-Step initialize repeat until change in update update
In the M-Step, we first compute , which is the value of if we imagined that the entire dataset is made up of copies of the current chunk. Then is updated using a weighted sum of and
M-Step compute update
The specific updates are:
Where is the number of occurrences of in the current iteration’s chunk of documents, and is a weighting parameter.
We can see that unlike batch LDA, in online LDA we only need to hold a small chunk of the data at a time, and once we’re done analyzing it, we never need it again. As with batch, once we’ve estimated , we can find the most probable words for each topic by looking at the word probabilities in each row of .
Intuitions of the Inference
If we squint and step back, LDA consists of using simple word counts in a clever way. The two parameters we ultimately care about are and . How do these get updated during training?
Updates of (the topic compositions for each document) are the prior plus a weighted sum of word counts. The word counts are weighted by , the probability of assigning the word to the topic. Intuitively, if we count a lot of instances of “potato” in a document, and “potato” is likely to be assigned to topic 2, then it makes sense that the document has more of topic 2 in it than we previously thought.
Updates of (the word-topic probabilities) use word counts weighted by the probability that the word will be assigned to the given topic. If “potato” shows up a lot in the dataset is likely to be assigned to topic 2, then it makes sense that should increase.
Now it’s time to run Online LDA on the Stack Overflow dataset to discover topics without overflowing our memory. Stack Exchange kindly provides (and updates) a data dump of all of its user generated content; I chose the stackoverflow.com-Posts.7z dataset.
Read, Clean, Parse, Repeat
The data arrives as a 27gb XML behemoth. The first step is isolating the text from the Title and Body fields for each row. These fields will comprise a ‘document’, and our dataset will be formatted as a text file with one document per line.
Since the file is so large, we need to incrementally read the XML. We also filter out non alpha-numeric characters. Details for this process can be found in xml_parse.py.
Once xml_parse.py runs, we get an 8gb text file containing around 8,000,000 stack overflow documents (title and body content). A couple examples:
Throw an error in a MySQL trigger If I have a trigger before the update on a table how can I throw an error that prevents the update on that table Compressing Decompressing Folders Files Does anyone know of a good way to compress or decompress files and folders in C quickly Handling large files might be necessary
LDA by Hoffman
We’ll use a Python implementation of online LDA written by Matt Hoffman, available on his webpage. We need to adapt the high-level running script for our application; to do so I created a wrapper for running LDA called online_lda.py. Use
python online_lda.py -h
to see the various command line arguments.
I’ve also added more comments to onlineldavb.py on the repo in case you’d like to further inspect how the actual Online LDA algorithm is implemented.
Building a Vocabulary
The LDA implementation assumes that we have a vocabulary file prior to training so that it can compactly represent documents as numeric word IDs. The vocabulary also allows us the algorithm to associate an index of with a word ID and hence with a word.
We can generate a domain-specific vocabulary using the first 100,000 Stack Overflow posts, and supplement it with the vocabulary provided by Hoffman, which contains the most frequent English words. Gensim has a nice library for creating vocabularies. We filter out words that appear in fewer than 10 documents, since they are often ‘junk’ words, and would probably not appear in the top words for a topic anyways since they appear so infrequently. Code for the vocabulary generation is found in dictionary.py.
Let’s kick it off!
python online_lda.py dataset.txt vocabulary.txt
The training took ~12 hours for a 100 topic model on my MacBook. The values of and are output to files every 10,000 iterations and when the training completes. We can then use one of the files to see the top 20 words for the top N topics. For instance, to print the top 2 topics with the final model, use:
python printtopics.py vocabulary.txt lambda-final.dat 2
topic 0 suspended:0.8356 authorization:0.0215 entityset:0.0128 treemap:0.0094 professionals:0.0086 best:0.0084 facts:0.0072 special:0.0062 syntax:0.0056 listing:0.0051 forwarding:0.0049 webparts:0.0047 duration:0.0045 valued:0.0039 halts:0.0038 baggage:0.0034 yeah:0.0034 ltaspdropdownlistgt:0.0033 twitter:0.0031 liable:0.0030 topic 1 support:0.7800 quarter:0.0380 fig:0.0278 luck:0.0160 1gb:0.0142 funeral:0.0124 visiting:0.0109 xiv:0.0071 screen:0.0063 commons:0.0046 monster:0.0040 flash:0.0039 faculty:0.0037 desire:0.0031 detached:0.0030 handler:0.0028 say:0.0028 everyday:0.0025 darker:0.0025 screen:0.0024
The numbers are the word-topic probabilities from .
We’ll use the approach from the first LDA post to create word clouds for two different topics:
We managed to find topics on a dataset of 8 million Stack Overflow posts by using Online LDA. Feel free to download the code and try it out on other datasets!
Credits & Links