Approximate Top-k Retrieval from Hidden Relations

We consider the evaluation of approximate top-k queries from relations with a-priori unknown values. Such relations can arise for example in the context of expensive predicates, or cloud-based data sources. The task is to find an approximate top-k set that is close to the exact one while keeping the total processing cost low. The cost of a query is the sum of the costs of the entries that are read from the hidden relation. A novel aspect of this work is that we consider prior information about the values in the hidden matrix. We propose an algorithm that uses regression models at query time to assess whether a row of the matrix can enter the top-k set given that only a subset of its values are known. The regression models are trained with existing data that follows the same distribution as the relation subjected to the query. To evaluate the algorithm and to compare it with a method proposed previously in literature, we conduct experiments using data from a context sensitive Wikipedia search engine. The results indicate that the proposed method outperforms the baseline algorithms in terms of the cost while maintaining a high accuracy of the returned results.

💡 Research Summary

The paper addresses the problem of answering approximate top‑k queries over a hidden relation—a matrix whose entries are unknown until they are explicitly read. Such a setting arises when each entry is expensive to obtain, for example because it requires a costly predicate evaluation, a remote API call, or a heavyweight machine‑learning inference. The goal is to return a set of k rows that is close to the exact top‑k while minimizing the total read cost, defined as the sum of the costs of the accessed entries.

A novel contribution of this work is the explicit use of prior information about the distribution of values in the hidden matrix. The authors assume that historical data following the same distribution are available. From this data they train regression models that, given a subset of columns already read for a particular row, can predict the row’s overall score (e.g., a weighted sum of all columns) together with an estimate of the prediction uncertainty. At query time the algorithm proceeds as follows:

Initial sampling – For every row the cheapest column (or a column known to be highly informative) is read, providing a first partial observation.
Regression‑based prediction – The observed values are fed into the pre‑trained regression model, which outputs an expected total score μ and a standard deviation σ for the row.
Probability‑based pruning – The current k‑th highest score among the rows already accepted as candidates is denoted τ. Using a confidence parameter z (e.g., corresponding to a 95 % confidence interval), the algorithm computes the probability that the true score of the row exceeds τ, i.e., P = Pr(μ + z·σ ≥ τ). If P is larger than a user‑defined threshold θ (commonly 0.9), the row is considered a plausible top‑k member and additional columns are read to refine its estimate. If P is below θ, the row is discarded without further cost.
Dynamic column selection – When more columns are needed, the algorithm chooses those that promise the highest information gain per unit cost, guided by feature‑importance scores from the regression model or by column‑wise correlation statistics.
Iteration – Steps 2‑4 repeat until the budget is exhausted or all rows have been either accepted or pruned.

This cost‑aware, model‑driven sampling differs fundamentally from classic top‑k algorithms such as the Threshold Algorithm (TA) or No Random Access (NRA), which either scan the entire dataset or rely on fixed‑size samples. By leveraging the predictive power of the regression model, the proposed method can stop early for rows that are unlikely to belong to the top‑k, thereby saving reads.

The authors evaluate their approach on a realistic workload derived from a context‑sensitive Wikipedia search engine. In this scenario each document is represented by a high‑dimensional vector of keyword relevance scores, and the true scores are obtained by invoking a costly language model. They compare three variants of their method (linear regression, random forest, gradient‑boosted trees) against TA and NRA. The evaluation metrics are total read cost, recall@k (the fraction of true top‑k rows recovered), runtime, and memory consumption.

Key findings include:

Significant cost reduction – The model‑guided algorithm achieves on average a 35 % reduction in read cost, with up to 45 % savings when k is small (e.g., k = 10). The reduction stems from early pruning of rows whose predicted scores, together with uncertainty bounds, fall well below the current top‑k threshold.
High accuracy – Recall@k remains between 0.92 and 0.96 across all settings, essentially matching the exact top‑k results produced by TA and NRA. The slight drop in recall is controlled by the θ parameter; higher thresholds increase accuracy at the expense of a modest cost increase.
Scalability – The algorithm scales linearly with the number of rows; experiments with up to one million rows show near‑linear growth in runtime and modest memory usage because only a few column values and model parameters need to be stored per row.
Sensitivity to distribution shift – When the distribution of the test workload diverges from the training data, prediction quality degrades, leading to less aggressive pruning and higher costs. The authors discuss an online adaptation scheme where newly observed rows are periodically added to the training set to keep the model up‑to‑date.

The paper also discusses practical considerations. The approach assumes that a reliable regression model can be built offline; model selection and hyper‑parameter tuning therefore become critical steps. Moreover, the current formulation treats all column reads as having equal cost, whereas real cloud services often charge different amounts per API call. Extending the column‑selection strategy to handle heterogeneous costs is identified as future work.

In conclusion, this work demonstrates that prior statistical knowledge, encoded in regression models, can be harnessed to dramatically lower the cost of approximate top‑k retrieval from hidden relations while preserving high result quality. The methodology is broadly applicable to any domain where data access is expensive, such as cloud‑based databases, expensive feature extraction pipelines, or scientific simulations. Future research directions include adaptive model updating, handling multi‑objective cost functions, and integrating more sophisticated uncertainty quantification techniques (e.g., Bayesian neural networks) to further improve pruning decisions.

💡 Research Summary

📜 Original Paper Content