Experiments

We compare the retrieval performance of DESM against BM25, a traditional count-based method, and Latent Semantic Analysis (LSA), a traditional vector-based method. We conduct our evaluations on two different test sets (explicit and implicit relevance judgements) and under two different experimental conditions (a large collection of documents and a telescoped subset).

Datasets

All the datasets that are used for this study are sampled from Bing’s large scale query logs. The body text for all the candidate documents are extracted from Bing’s document index.

Explicitly judged test set

This evaluation set consists of 7,741 queries randomly sampled from Bing’s query logs from the period of October, 2014 to December, 2014. For each sampled query, a set of candidate documents is constructed by retrieving the top results from Bing over multiple scrapes during a period of a few months. In total the final evaluation set contains 171,302 unique documents across all queries which are then judged by human evaluators on a five point relevance scale (Perfect, Excellent, Good, Fair and Bad).

Implicit feedback based test set

This dataset is sampled from the Bing logs from the period of the September 22, 2014 to September 28, 2014. The dataset consists of the search queries submitted by the user and the corresponding documents that were returned by the search engine in response. The documents are associated with a binary relevance judgment based on whether the document was clicked by the user. This test set contains 7,477 queries and the 42,573 distinct documents.

Implicit feedback based training set

This dataset is sampled exactly the same way as the previous test but from the period of September 15, 2014 to September 21, 2014 and has 7,429 queries and 42,253 distinct documents. This set is used for tuning the parameters for the BM25 baseline and the mixture model.

Experiment Setup

We perform two distinct sets of evaluations for all the experimental and baseline models. In the first experiment, we consider all documents retrieved by Bing (from the online scrapes in the case of the explicitly judged set or as recorded in the search logs in the case of the implicit feedback based sets) as the candidate set of documents to be re-ranked for each query. The fact that each of the documents were retrieved by the search engine implies that they are all at least marginally relevant to the query. Therefore, this experimental design isolates performance at the top ranks. As mentioned in Section [sec:aboutness], there is a parallel between this experiment setup and the telescoping evaluation strategy, and has been used often in recent literature (e.g., ). Note that by having a strong retrieval model, in the form of the Bing search engine, for first stage retrieval enables us to have a high confidence candidate set and in turn ensures reliable comparison with the baseline BM25 feature.

In our non-telescoped experiment, we consider every distinct document in the test set as a candidate for every query in the same dataset. This setup is more in line with the traditional IR evaluation methodologies, where the model needs to retrieve the most relevant documents from a single large document collection. Our empirical results in Section 12 will show that the DESM model is a strong re-ranking signal, but as a standalone ranker, it is prone to false positives. Yet, when we mix our neural model (DESM) with a counting based model (BM25), good performance is achieved.

For all the experiments we report the normalized discounted cumulative gain (NDCG) at different rank positions as a measure of performance for the different models under study.

Baseline models

We compare the DESM models to a term-matching based baseline, in BM25, and a vector space model baseline, in Latent Semantic Analysis (LSA). For the BM25 baseline we use the values of 1.7 for the $`k_1`$ parameter and 0.95 for the $`b`$ parameter based on a parameter sweep on the implicit feedback based training set. The LSA model is trained on the body text of 366,470 randomly sampled documents from Bing’s index with a vocabulary size of 480,608 words. Note that unlike the word2vec models that train on word co-occurrence data, the LSA model by default trains on a word-document matrix.

Introduction

A two dimensional PCA projection of the 200-dimensional embeddings. Relevant documents are yellow, irrelevant documents are grey, and the query is blue. To visualize the results of multiple queries at once, before dimensionality reduction we centre query vectors at the origin and represent documents as the difference between the document vector and its query vector. (a) uses IN word vector centroids to represent both the query and the documents. (b) uses IN for the queries and OUT for the documents, and seems to have a higher density of relevant documents near the query.

Identifying relevant documents for a given query is a core challenge for Web search. For large-scale search engines, it is possible to identify a very small set of pages that can answer a good proportion of queries . For such popular pages, clicks and hyperlinks may provide sufficient ranking evidence and it may not be important to match the query against the body text. However, in many Web search scenarios such query-content matching is crucial. If new content is available, the new and updated documents may not have click evidence or may have evidence that is out of date. For new or tail queries, there may be no memorized connections between the queries and the documents. Furthermore, many search engines and apps have a relatively smaller number of users, which limits their ability to answer queries based on memorized clicks. There may even be insufficient behaviour data to learn a click-based embedding or a translation model . In these cases it is crucial to model the relationship between the query and the document content, without click data.

When considering the relevance of document body text to a query, the traditional approach is to count repetitions of query terms in the document. Different transformation and weighting schemes for those counts lead to a variety of possible TF-IDF ranking features. One theoretical basis for such features is the probabilistic model of information retrieval, which has yielded the very successful TF-IDF formulation BM25. As noted by , the probabilistic approach can be restricted to consider only the original query terms or it can automatically identify additional terms that are correlated with relevance. However, the basic commonly-used form of BM25 considers query terms only, under the assumption that non-query terms are less useful for document ranking.

In the probabilistic approach, the 2-Poisson model forms the basis for counting term frequency . The stated goal is to distinguish between a document that is about a term and a document that merely mentions that term. These two types of documents have term frequencies from two different Poisson distributions, such that documents about the term tend to have higher term frequency than those that merely mention it. This explanation for the relationship between term frequency and aboutness is the basis for the TF function in BM25 .

The new approach in this paper uses word occurrences as evidence of aboutness, as in the probabilistic approach. However, instead of considering term repetition as evidence of aboutness it considers the relationship between the query terms and all the terms in the document. For example, given a query term “yale”, in addition to considering the number of times Yale is mentioned in the document, we look at whether related terms occur in the document, such as “faculty” and “alumni”. Similarly, in a document about the Seahawks sports team one may expect to see the terms “highlights” and “jerseys”. The occurrence of these related terms in sufficient numbers is a way to distinguish between documents that merely mention Yale or Seahawks and the documents that are about the university or about the sports team.

With this motivation, in Section 9 we describe how the input and the output embedding spaces learned by a word2vec model may be jointly particularly attractive for modelling the aboutness aspect of document ranking. Table [tbl:results-nearestneighbors] gives some anecdotal evidence of why this is true. If we look in the neighbourhood of the IN vector of the word “yale” then the other IN vectors that are close correspond to words that are functionally similar or of the same type, e.g., “harvard” and “nyu”. A similar pattern emerges if we look at the OUT vectors in the neighbourhood of the OUT vector of “yale”. On the other hand, if we look at the OUT vectors that are closest to the IN vector of “yale” we find words like “faculty” and “alumni”. We use this property of the IN-OUT embeddings to propose a novel Dual Embedding Space Model (DESM) for document ranking. Figure 1 further illustrates how in this Dual Embedding Space model, using the IN embeddings for the query words and the OUT embeddings for the document words we get a much more useful similarity definition between the query and the relevant document centroids.

The main contributions of this paper are,

A novel Dual Embedding Space Model, with one embedding for query words and a separate embedding for document words, learned jointly based on an unlabelled text corpus.
We propose a document ranking feature based on comparing all the query words with all the document words, which is equivalent to comparing each query word to a centroid of the document word embeddings.
We analyse the positive aspects of the new feature, preferring documents that contain many words related to the query words, but also note the potential of the feature to have false positive matches.
We empirically compare the new approach to a single embedding and the traditional word counting features. The new approach works well on its own in a telescoping setting, re-ranking the top documents returned by a commercial Web search engine, and in combination with word counting for a more general document retrieval task.