How does a search engine rank the results that it returns?
How does a question and answering system select a final answer?
How does a translation system ultimately choose a translation for some text?
In the next few posts, we will gain insight into these questions by exploring a sub-domain of machine learning called Learning to Rank. Each of the three questions has a common last step, where an algorithm takes a list of ‘candidates’ and produces a ranked list of results.
In the translation case, we give a system of string of text to translate, and it produces, say, 20 potential candidate translations. Then a Learning to Rank model steps in, ranks those 20 candidates, and returns the top candidate.
In the question-answering case, we give a system a query string, and it produces, say, 100 potential answers. Again we use a Learning to Rank model to produce a ranking of the potential answers, and select the top one as the final answer.
In the search engine case, we type in a search query, and the search engine produces, say, 1000 candidate search results. The order of the final results that you eventually see are produced by a Learning to Rank model that ranks those 1000 candidate results.
To explore the details, we’ll first define the problem addressed in the Learning to Rank domain and survey the methods, datasets, and metrics that are used. Then in the next posts, we’ll look in depth at a state of the art model called LambdaMART, and devise a way of visualizing its training process and the resulting model. Along the way we’ll also develop wrapper code for running a major learning to rank library called JForests and discuss other problems tackled by the field.
Learning to Rank
As a first step, we’ll define the problem. While learning to rank has numerous applications including machine translation, QA, and even spell checking, we’ll use search engine document ranking as a running example, mainly due to availability of datasets.
In the context of document retrieval, consider a search engine that accepts a query, and produces a ranking of documents related to the query using a technique such as TF-IDF scoring. Suppose we take the top 100 documents. The goal of learning to rank is to produce a better ranking of these 100 candidate documents for the given query.
Most research has posed learning to rank as a supervised learning problem, although there has been work in semi-supervised ranking. We’ll only focus on the supervised problem here. Given a list of queries and candidate documents with annotated relevance scores, the learning to rank model learns to predict the relevance of candidate documents for a new query, and derives a ranking from the predictions. Let’s define that a bit more precisely.
More precise problem definition
The data consists of queries, documents, and relevance scores. Let denote the set of queries, denote the set of documents, and denote the set of relevance scores.
For each query we have a list of candidate documents . We want to produce a ranking for the documents , specifically a permutation of the document indices . To do so, we want to learn a function that produces an optimal permutation .
As we’ll see, this ranking is produced in different ways; for instance, some models derive the rankings by producing a score for each pair, while others learn to tell whether the rank of is higher than .
In the standard training set, each document is assigned a relevance score that says how relevant the document is to query . These scores can be assigned by human annotators, or derived implicitly from sources such as click through data.
We can think of the training set as consisting of 3-tuples ; one training example consists of a query, a candidate document, and a measure of how relevant that document is to the query. An example labeling is , where 1 means “not relevant” and 5 means “very relevant”.
For instance, we could have a query = “Leo Tolstoy famous novel”, with candidate documents , and relevance labels .
Each pair is actually a feature vector , so the final training set is defined as for , where is the number of queries, and is the number of candidate documents for each query.
A concrete training set
To make this concrete, here’s a few training examples taken from the OHSUMED dataset:
0 qid:1 1:1.000000 2:1.000000 3:0.833333 4:0.871264 5:0 6:0 7:0 8:0.941842 9:1.000000 10:1.000000 11:1.000000 12:1.000000 13:1.000000 14:1.000000 15:1.000000 16:1.000000 17:1.000000 18:0.719697 19:0.729351 20:0 21:0 22:0 23:0.811565 24:1.000000 25:0.972730 26:1.000000 27:1.000000 28:0.922374 29:0.946654 30:0.938888 31:1.000000 32:1.000000 33:0.711276 34:0.722202 35:0 36:0 37:0 38:0.798002 39:1.000000 40:1.000000 41:1.000000 42:1.000000 43:0.959134 44:0.963919 45:0.971425 #docid = 244338 2 qid:1 1:0.600000 2:0.600000 3:1.000000 4:1.000000 5:0 6:0 7:0 8:1.000000 9:0.624834 10:0.767301 11:0.816099 12:0.934805 13:0.649685 14:0.680222 15:0.686762 16:0.421053 17:0.680904 18:1.000000 19:1.000000 20:0 21:0 22:0 23:1.000000 24:0.401391 25:0.938966 26:0.949446 27:0.984769 28:0.955266 29:1.000000 30:0.997786 31:0.441860 32:0.687033 33:1.000000 34:1.000000 35:0 36:0 37:0 38:1.000000 39:0.425450 40:0.975968 41:0.928785 42:0.978524 43:0.979553 44:1.000000 45:1.000000 #docid = 143821 0 qid:1 1:0.400000 2:0.400000 3:0.555555 4:0.563658 5:0 6:0 7:0 8:0.545844 9:0.380576 10:0.427356 11:0.468244 12:0.756579 13:0.366316 14:0.360838 15:0.373909 16:0.210526 17:0.479859 18:0.595237 19:0.608701 20:0 21:0 22:0 23:0.613865 24:0.184562 25:0.791539 26:0.863833 27:0.957024 28:0.896468 29:0.941132 30:0.946305 31:0.232558 32:0.507810 33:0.603068 34:0.616847 35:0 36:0 37:0 38:0.614004 39:0.202374 40:0.812801 41:0.868091 42:0.958879 43:0.926045 44:0.944576 45:0.963753 #docid = 285257 ... 0 qid:63 1:0.142857 2:0.162076 3:0.250000 4:0.250000 5:0 6:0 7:0 8:0.167856 9:0.078385 10:0.103707 11:0.132891 12:0.419191 13:0.027399 14:0.027300 15:0.027300 16:0.000000 17:0.000000 18:0.000000 19:0.000000 20:0 21:0 22:0 23:0.000000 24:0.000000 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.000000 30:0.000000 31:0.026316 32:0.063236 33:0.444444 34:0.407079 35:0 36:0 37:0 38:0.189973 39:0.011360 40:0.068814 41:0.066431 42:0.287984 43:0.188175 44:0.000000 45:0.046177 #docid = 173315 2 qid:63 1:0.000000 2:0.000000 3:0.000000 4:0.000000 5:0 6:0 7:0 8:0.000000 9:0.000000 10:0.000000 11:0.000000 12:0.000000 13:0.000000 14:0.000000 15:0.000000 16:0.000000 17:0.000000 18:0.000000 19:0.000000 20:0 21:0 22:0 23:0.000000 24:0.000000 25:0.000000 26:0.000000 27:0.000000 28:0.000000 29:0.000000 30:0.000000 31:0.000000 32:0.000000 33:0.000000 34:0.000000 35:0 36:0 37:0 38:0.000000 39:0.000000 40:0.000000 41:0.000000 42:0.000000 43:0.000000 44:0.000000 45:0.000000 #docid = 9897
We can see that the relevance scores (the first number of each line) are between 0 and 2. Looking at the original OHSUMED text, we see that the first example is with:
: “Are there adverse effects on lipids when progesterone is given with estrogen replacement therapy”
= #docid 244338 = “Effects on bone of surgical menopause and estrogen therapy with or without progesterone replacement in cynomolgus monkeys.”
= 0 = not relevant
and the second example is with:
= #docid 143821 = “Cardiovascular benefits of estrogen replacement therapy.”
= 2 = very relevant
The text has been converted into the features described on pg.11 of this paper. Note that I only showed the document title here and omitted the body text.
At test time, we receive a query and a list of candidate documents . We evaluate and return a ranking of the documents , specifically a permutation of the document indices .
Learning to Rank Datasets
There are several benchmark datasets for Learning to Rank that can be used to evaluate models.
Microsoft Research released the LETOR 3.0 and LETOR 4.0 datasets. LETOR 3.0 contains data from a 2002 crawl of .gov web pages and associated queries, as well as medical search queries and medical journal documents from the OHSUMED dataset. LETOR 4.0 contains queries from the Million Query Track from TREC and document ids from Gov2, another crawl of .gov web pages. Data is represented as raw feature vectors; information about the features used and the datasets is found here. One thing that I like about the OHSUMED data is that the original text of the queries and documents is available, which can be helpful for interpreting model output or deriving new features. Unfortunately obtaining the original GOV data is nontrivial.
Microsoft Research also released a larger scale dataset consisting 30,000 queries, candidate features, and relevance judgements derived from the Bing search engine. The MSLR-WEB30K dataset contains all of the queries, while a smaller MSLR-WEB10K dataset contains a 10,000 query sample of MSLR-WEB30K. Unfortunately, we are unable to retrieve the original query text and URLs to further interpret results, but the dataset is nevertheless very useful for evaluating a model and producing metrics.
Yahoo sponsored a Learning to Rank competition in 2010 and released a dataset consisting of 36,000 queries and 883,000 document IDs. The datasets are available here, but require a university email to download. Similar to MSLR-WEB, we receive the raw features and relevance scores, and are unable to see the text of the query or URL.
Learning to Rank Evaluation Metrics
When we run a learning to rank model on a test set to predict rankings, we evaluate the performance using metrics that compare the predicted rankings to the annotated gold-standard labels. Before reviewing the popular learning to rank metrics, let’s introduce notation.
We’ll assume that there are n queries, and that for each query we have a list of m candidates with annotated relevance scores . The ranker produces a ranking , so that represents the predicted ranking of document .
Intuitively, we would like our ranking to have high relevance scores, with the highest scores receiving the highest rankings. Now we’ll look at several metrics that capture these intuitions.
Normalized Discounted Cumulative Gain (NDCG)
Normalized Discounted Cumulative Gain (NDCG) is a popular learning to rank metric that relies on Cumulative Gain and Discounted Cumulative Gain.
First we’ll look at Cumulative Gain. Suppose we have:
That is, the top ranked document has a relevance score of 5, the second ranked document has a relevance score of 2, and so on.
The cumulative gain is simply the sum of the relevance scores:
Notice that this doesn’t take ranking order into account; if our ranking was reversed:
it would receive the same cumulative gain score:
Discounted Cumulative Gain
Discounted Cumulative Gain (DCG) takes order into account by ‘rewarding’ high relevance scores that appear in high ranks. A high relevance score appearing at a low rank receives a lower ‘reward’. To compute the metric, we use:
We can see that in position 1, the high relevance score 5 is worth 31, while in position 3 it’s worth 15.5. The lower relevance score 2 is only worth 1.9 in position 2. With DCG order matters; the score would increase if we swapped the rankings of and .
Normalized Discounted Cumulative Gain
is just a normalized version of . The normalization is done by finding the of an ‘ideal’ ranking. In our example, the ideal ranking would be:
We can see that if our predicted rankings were ideal, they would receive an score of .
Mean Average Precision (MAP)
Mean average precision (MAP) is another popular metric for learning to rank, specifically for applications with binary relevance scores, e.g. “correct” and “incorrect” in question-answering.
MAP relies on Precision at k (P@k), which involves counting the number of relevant (or “correct”) documents in the first k ranked positions, and dividing by k:
Average precision is then defined as:
Notice that is computed for a single query. is the mean value over all queries:
Learning to Rank Models
Learning to Rank models can be broadly classified into three groups: scorewise, pairwise, and listwise. We’ll introduce and provide an example of each group. There are many more learning to rank models than the ones covered here.
Pointwise models, also known as ‘scorewise’ models, create rankings by computing a real-valued score for each example , then sort by the score.
An advantage of pointwise models is that they have an interpretable confidence score for each query-document pair. Possible disadvantages are that there is not a notion of grouping the training instances (e.g. by query), and that the prediction target is exact relevance rather than relative position.
Pairwise models learn to decide which of two documents is ranked higher. The models rely on the observation that in learning to rank problems we ultimately care about the relative position of documents rather than exact distances between document scores.
Pairwise models first isolate the documents for each query, then form a training set consisting of pairs of documents, and a label for whether the first document has a higher rank than the second document. So each training example is of the form . Model-wise, we find such that when .
The problem is transformed into a binary classification problem; we train a binary classifier to minimize the number of incorrectly ordered pairs.
Listwise models are trained by optimizing a loss function that measures predicted ranks versus actual ranks for a query’s entire candidate list. A loss function such as MAP or NDCG can be directly optimized by listwise models. Examples include coordinate ascent and LambdaRank.
We’ve introduced the problem of Learning to Rank, and briefly surveyed its algorithms. In the next post, we’ll examine the LambdaMART model, and develop a way of visualizing its training process.
Sources and Further Reading
“Tree Ensembles for Learning to Rank”, Yasser Ganjisaffar 2011 PhD Thesis