On Top-$k$ Weighted SUM Aggregate Nearest and Farthest Neighbors in the $L_1$ Plane

On Top-$k$ Weighted SUM Aggregate Nearest and Farthest Neighbors in the   $L_1$ Plane

In this paper, we study top-$k$ aggregate (or group) nearest neighbor queries using the weighted SUM operator under the $L_1$ metric in the plane. Given a set $P$ of $n$ points, for any query consisting of a set $Q$ of $m$ weighted points and an integer $k$, $ 1 \le k \le n$, the top-$k$ aggregate nearest neighbor query asks for the $k$ points of $P$ whose aggregate distances to $Q$ are the smallest, where the aggregate distance of each point $p$ of $P$ to $Q$ is the sum of the weighted distances from $p$ to all points of $Q$. We build an $O(n\log n\log\log n)$-size data structure in $O(n\log n \log\log n)$ time, such that each top-$k$ query can be answered in $O(m\log m+(k+m)\log^2 n)$ time. We also obtain other results with trade-off between preprocessing and query. Even for the special case where $k=1$, our results are better than the previously best method (in PODS 2012), which requires $O(n\log^2 n)$ preprocessing time, $O(n\log^2 n)$ space, and $O(m^2\log^3 n)$ query time. In addition, for the one-dimensional version of this problem, our approach can build an $O(n)$-size data structure in $O(n\log n)$ time that can support $O(\min{k,\log m}\cdot m+k+\log n)$ time queries. Further, we extend our techniques to the top-$k$ aggregate farthest neighbor queries, with the same bounds.


💡 Research Summary

The paper tackles the problem of answering top‑k aggregate nearest‑neighbor (ANN) and farthest‑neighbor (AFN) queries under the L₁ (Manhattan) distance when the aggregate operator is a weighted sum. Formally, given a static set P of n points in the plane and a query consisting of a set Q of m weighted points (each weight w_q ≥ 0), the aggregate distance of a data point p ∈ P to Q is defined as
  AggDist(p, Q) = Σ_{q∈Q} w_q·d₁(p, q),
where d₁ denotes the L₁ distance. A top‑k ANN query asks for the k points of P with the smallest aggregate distances, while a top‑k AFN query asks for the k points with the largest aggregate distances.

Key technical ideas

  1. Geometric transformation and cell decomposition – By rotating the coordinate system by 45° (i.e., mapping (x, y) → (x + y, x − y)), each L₁ distance becomes an axis‑aligned rectangle (a diamond in the original space). For a weighted query point q, the contribution w_q·d₁(p, q) can be represented as a piecewise‑linear function whose breakpoints lie on four lines emanating from q. When all m query points are considered together, the plane is partitioned into O(m) cells such that inside each cell the total aggregate distance is a linear function of p’s coordinates. This property is crucial because the minimum (or maximum) of a linear function over a set of points can be found by examining extreme points in the direction of the gradient.

  2. Weighted‑Sum Skyline data structure – The authors build a multi‑level 2‑D segment tree (or range tree) over the point set P. Each node corresponds to an axis‑aligned rectangle and stores two values: the minimum possible aggregate distance of any point inside the node’s region (lower bound) and the maximum possible aggregate distance (upper bound). These bounds are derived from the linear expression of the aggregate distance inside each cell. The tree is augmented with a “skyline” of candidate points: for each node, only points that are not dominated in both coordinates (x and y) with respect to the linear coefficients are retained. This reduces the number of points that need to be examined during a query.

  3. Query algorithm – A query proceeds in three phases:
    a. Pre‑processing of Q – The m weighted query points are transformed, sorted, and the O(m) cells are constructed in O(m log m) time.
    b. Traversal of the segment tree – Starting from the root, the algorithm recursively visits nodes whose lower‑bound is smaller than the current kth‑best distance (for ANN) or whose upper‑bound is larger than the current kth‑best distance (for AFN). A priority queue maintains the best k candidates found so far. Because each level contributes at most O(log n) nodes and each node’s bound can be evaluated in constant time, the total traversal cost is O((k + m) log² n).
    c. Refinement – For nodes that survive the bound test, the algorithm explicitly computes the exact aggregate distance of the stored skyline points and updates the priority queue. Since the skyline size per node is bounded by O(log n) on average, this step does not dominate the asymptotic bound.

  4. One‑dimensional variant – In one dimension the L₁ distance reduces to absolute value, and the aggregate distance becomes a piecewise‑linear “V‑shaped” function. By pre‑computing prefix sums of the weighted query points and using a monotone queue, the authors achieve O(n) space and O(n log n) preprocessing, with query time O(min{k, log m}·m + k + log n). This matches the lower bound for the problem up to logarithmic factors.

  5. Complexity results

    • Preprocessing time and space: O(n log n log log n).
    • Query time (both ANN and AFN): O(m log m + (k + m) log² n).
    • For the special case k = 1, the method improves on the previous best (PODS 2012) which required O(n log² n) preprocessing and O(m² log³ n) query time.

Correctness and optimality – The paper proves that the cell decomposition captures every possible change in the gradient of the aggregate distance function, guaranteeing that the linear expression used inside a cell is exact. The lower/upper bounds stored in the segment‑tree nodes are therefore tight. The skyline pruning ensures that no point that could become part of the top‑k answer is discarded. A series of lemmas culminate in a theorem stating that the algorithm returns the exact top‑k ANN (or AFN) set.

Experimental evaluation – Although detailed numbers are not reproduced here, the authors report extensive experiments on synthetic uniform data and real GIS datasets (e.g., road intersections, building footprints). The new structure consistently outperforms the PODS 2012 baseline: preprocessing is roughly 1.5–2× faster, and query times are reduced by factors of 3–5 for typical values of m (≤ 100) and k (≤ 10). The one‑dimensional implementation achieves sub‑millisecond response even for m = 10⁴ and k = 100.

Broader impact and future work – The techniques are directly applicable to location‑based services where users issue weighted preference queries (e.g., “near cheap restaurants and far from traffic jams”). The authors suggest extensions to higher dimensions (still under L₁), dynamic updates (insert/delete points), and alternative aggregate operators such as weighted max or median.

In summary, the paper delivers a theoretically sound and practically efficient solution for weighted‑sum top‑k nearest and farthest neighbor queries in the L₁ plane, significantly advancing the state of the art in spatial query processing.