A Short Note on Proximity-based Scoring of Documents with Multiple Fields
The BM25 ranking function is one of the most well known query relevance document scoring functions and many variations of it are proposed. The BM25F function is one of its adaptations designed for modeling documents with multiple fields. The Expanded Span method extends a BM25-like function by taking into considerations of the proximity between term occurrences. In this note, we combine these two variations into one scoring method in view of proximity-based scoring of documents with multiple fields.
💡 Research Summary
The paper proposes a novel document scoring function that integrates two well‑known extensions of the BM25 probabilistic retrieval model: BM25F, which handles multiple weighted fields, and the Expanded Span method, which incorporates term‑proximity information. The authors first review the standard BM25 formula, highlighting its parameters k₁ (term‑frequency saturation) and b (document‑length normalization). They then describe BM25F, which treats a document as a set of named text fields (e.g., title, body, price) and assigns each field a boost weight (boost_f) and a field‑specific length‑normalization factor (b_f).
Next, the Expanded Span approach is explained. Documents are modeled as ordered arrays of term positions; for a given query, the method extracts “spans” – ordered chains of query‑term occurrences that satisfy three constraints: (1) the terms appear in the same order as in the query, (2) spans never overlap, and (3) the distance between consecutive terms in a span does not exceed a configurable window size M. For each term t, a span‑based term frequency rc(t,D) is computed as the sum over all spans s of in(t,s)·len(s)^z / width(s)^x, where in(t,s) indicates whether t occurs in s, len(s) is the number of positions in the span, width(s) is the distance between the first and last positions (or 1/M if the span length is one), and z and x control the influence of term frequency within a span and the penalty for span width, respectively. The rc value replaces the ordinary tf in the BM25 scoring equation, thereby injecting proximity information.
The core contribution of the paper is the seamless combination of these two ideas. For a document D consisting of K fields, the authors define a field‑specific span‑based frequency rc(t,f,D) = Σ_{s∈D
Comments & Academic Discussion
Loading comments...
Leave a Comment