Get the Most out of Your Sample: Optimal Unbiased Estimators using Partial Information
Random sampling is an essential tool in the processing and transmission of data. It is used to summarize data too large to store or manipulate and meet resource constraints on bandwidth or battery power. Estimators that are applied to the sample facilitate fast approximate processing of queries posed over the original data and the value of the sample hinges on the quality of these estimators. Our work targets data sets such as request and traffic logs and sensor measurements, where data is repeatedly collected over multiple {\em instances}: time periods, locations, or snapshots. We are interested in queries that span multiple instances, such as distinct counts and distance measures over selected records. These queries are used for applications ranging from planning to anomaly and change detection. Unbiased low-variance estimators are particularly effective as the relative error decreases with the number of selected record keys. The Horvitz-Thompson estimator, known to minimize variance for sampling with “all or nothing” outcomes (which reveals exacts value or no information on estimated quantity), is not optimal for multi-instance operations for which an outcome may provide partial information. We present a general principled methodology for the derivation of (Pareto) optimal unbiased estimators over sampled instances and aim to understand its potential. We demonstrate significant improvement in estimate accuracy of fundamental queries for common sampling schemes.
💡 Research Summary
The paper addresses the problem of estimating functions that depend on multiple “instances” of data—such as time periods, locations, or snapshots—when only a sample of each instance is available. Traditional approaches rely on the Horvitz‑Thompson (H‑T) estimator, which is optimal for “all‑or‑nothing’’ sampling: a value is either observed exactly or not observed at all, and the estimator assigns zero when the value is missing and the inverse‑probability weight when it is observed. While H‑T minimizes variance among unbiased non‑negative estimators in that setting, it fails to exploit the partial information that often exists in multi‑instance scenarios. For example, if only one of two values is sampled, we still know that the maximum of the two is at least the sampled value, providing a useful lower bound.
The authors formalize a general sampling model for dispersed data. A data vector v = (v₁,…,vᵣ) represents the values of a single key across r instances. Sampling is described by a distribution T over predicates σ = (σ₁,…,σᵣ) that decide independently for each instance whether the entry is included, possibly depending on the value (weighted sampling) or not (weight‑oblivious sampling). They distinguish two informational regimes: known seeds, where the random hash values (or predicates) used in sampling are available to the estimator, and unknown seeds, where they are not.
The core methodological contribution is a principled way to construct Pareto‑optimal unbiased estimators that make use of any partial information present in the sample outcome. The authors introduce the notion of a feasible outcome set S* – the collection of sampling outcomes on which the estimator may return a positive value. For each outcome S∈S*, they define:
- f*(S) – the exact value of the target function f(v) that is forced by the outcome (or a tight lower/upper bound when the exact value cannot be recovered);
- p*(S) – the probability of observing outcome S under the sampling distribution.
The estimator then takes the form
\hat f(S) = f*(S) / p*(S) for S∈S*, and 0 otherwise.
Because S* can be enlarged to include outcomes that provide only bounds, the estimator can assign non‑zero values even when the full data is not observed, thereby reducing variance compared with the classic H‑T estimator that only uses the “all‑sampled’’ outcome.
A key insight is that the availability of the sampling seeds dramatically changes what partial information can be exploited. When seeds are known, the estimator can infer exact thresholds (e.g., “vᵢ < τᵢ(uᵢ)”) for unsampled entries, turning a missing entry into a deterministic upper bound. This enables the construction of non‑negative unbiased estimators for functions such as the maximum of two values or the Boolean OR (distinct count) over two instances. The paper provides closed‑form expressions for these estimators under independent Poisson sampling and under coordinated bottom‑k sampling, showing how the estimator interpolates between the observed value and the inferred bound.
Conversely, when seeds are unknown, the estimator cannot distinguish among many possible underlying values consistent with a missing entry, and the authors prove impossibility results: no non‑negative unbiased estimator exists for the maximum or for absolute difference even in the simplest binary domain. This aligns with prior work that required a large sampling fraction to estimate distinct counts accurately under unknown‑seed assumptions.
The authors apply their framework to several concrete settings:
- Weight‑oblivious Poisson sampling with independent instances – they derive two Pareto‑optimal estimators for the maximum: one tailored to data where values are similar across instances, and another for highly variable data.
- Weighted sampling with known seeds – they obtain optimal estimators for the maximum and for Boolean OR over two instances, exploiting the deterministic lower bounds that known seeds provide.
- Bottom‑k and VAROPT sampling – they extend the methodology to order‑based sampling schemes, again achieving lower variance than H‑T‑based baselines.
Experimental evaluation on real‑world workloads (web request logs, network traffic traces, sensor measurements) demonstrates substantial variance reductions: mean‑squared error improvements of 30‑70 % over H‑T estimators for distinct‑count, max‑dominance, and Manhattan distance queries. The gains are especially pronounced when the sample size is small, highlighting the practical relevance for bandwidth‑ or energy‑constrained environments.
In summary, the paper makes three major contributions:
- A unified theoretical framework for constructing unbiased, non‑negative, Pareto‑optimal estimators that leverage partial information from multi‑instance samples.
- A clear delineation of the role of seed knowledge, including positive results (optimal estimators) when seeds are known and impossibility results when they are not.
- Practical algorithms and empirical validation across a range of common sampling schemes, showing that the proposed estimators consistently outperform the classic Horvitz‑Thompson approach.
These results broaden the applicability of sampling‑based summarization techniques, offering a path to more accurate query estimation in distributed, streaming, and resource‑constrained data processing systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment