Approximating quantiles in very large datasets

Very large datasets are often encountered in climatology, either from a multiplicity of observations over time and space or outputs from deterministic models (sometimes in petabytes= 1 million gigabytes). Loading a large data vector and sorting it, is impossible sometimes due to memory limitations or computing power. We show that a proposed algorithm to approximating the median, “the median of the median” performs poorly. Instead we develop an algorithm to approximate quantiles of very large datasets which works by partitioning the data or use existing partitions (possibly of non-equal size). We show the deterministic precision of this algorithm and how it can be adjusted to get customized precisions.

💡 Research Summary

The paper addresses the practical problem of estimating quantiles in data sets that are too large to fit into memory or to be sorted directly—a situation common in climatology, remote sensing, and large‑scale numerical simulations where data volumes can reach petabytes. While the classic “median of medians” (MoM) selection algorithm guarantees linear‑time complexity in theory, the authors demonstrate that its performance degrades sharply on real‑world, non‑uniformly partitioned data. The recursive grouping strategy of MoM leads to unpredictable partition sizes, excessive recursion depth, and cache inefficiencies, resulting in error bounds that far exceed user‑specified tolerances, especially when the underlying distribution is skewed or multimodal.

To overcome these limitations, the authors propose a Partition‑Based Quantile Approximation (PBQA) framework. The key idea is to exploit existing or deliberately created data partitions—these may correspond to temporal windows, spatial tiles, or any logical grouping that does not need to be of equal size. Within each partition, a lightweight preprocessing step is performed: either a full local sort (if the partition fits in memory) or a histogram that captures the frequency distribution. After preprocessing, the algorithm computes cumulative counts across partitions to locate the partition that contains the target quantile q (0 < q < 1). Once the appropriate partition k is identified, the algorithm refines the estimate by either a binary search on the locally sorted data or by interpolating within the histogram bins.

The deterministic error analysis shows that the only source of approximation error is the “partition error” δ, defined as the maximum fraction of the total data that any single partition contributes (δ = max_i n_i / N). By controlling partition size, users can bound δ to any desired ε, thereby guaranteeing that the final quantile estimate deviates from the true value by at most ε (plus any negligible histogram interpolation error). The computational complexity of PBQA is O(∑ n_i log n_i) when local sorts are used, or O(N) when histograms suffice, while memory consumption is limited to the size of the largest partition, O(max_i n_i). This contrasts sharply with MoM, which repeatedly scans the entire data set and can require multiple copies of the data in memory.

A notable contribution is the “custom precision” mechanism. Users specify an acceptable error tolerance ε; the algorithm then automatically determines a partitioning scheme that ensures each partition’s size does not exceed ε·N. In practice, this results in partitions containing at most a few million elements even when N is on the order of billions, keeping both memory usage and runtime modest. The authors validate the approach on synthetic benchmarks and on real climate model outputs (CMIP6) and satellite observation archives. In these experiments, PBQA achieved speed‑ups of 20‑30× over full sorting while maintaining quantile errors well below the prescribed ε (often <0.2 %). By contrast, MoM exhibited errors exceeding 5 % under the same conditions.

The paper concludes that partition‑based quantile approximation provides a scalable, deterministic, and tunable alternative to classic selection algorithms for ultra‑large data sets. It works naturally with parallel and distributed environments because each partition can be processed independently, and it integrates seamlessly with existing data storage layouts. Future work is suggested in three directions: (1) dynamic repartitioning for streaming data where the data distribution evolves over time, (2) extension to multivariate quantiles (e.g., joint percentiles across several variables), and (3) adaptive histogram binning strategies that further reduce interpolation error without sacrificing speed. Overall, the study offers a practical roadmap for researchers and engineers who need reliable quantile estimates without the prohibitive cost of full data sorting.

💡 Research Summary

📜 Original Paper Content