Parallel calculation of the median and order statistics on GPUs with application to robust regression

Parallel calculation of the median and order statistics on GPUs with   application to robust regression
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We present and compare various approaches to a classical selection problem on Graphics Processing Units (GPUs). The selection problem consists in selecting the $k$-th smallest element from an array of size $n$, called $k$-th order statistic. We focus on calculating the median of a sample, the $n/2$-th order statistic. We introduce a new method based on minimization of a convex function, and show its numerical superiority when calculating the order statistics of very large arrays on GPUs. We outline an application of this approach to efficient estimation of model parameters in high breakdown robust regression.


💡 Research Summary

The paper addresses the classic selection problem—finding the k‑th smallest element in an array—on modern Graphics Processing Units (GPUs), with a particular focus on the median (the n/2‑th order statistic). After reviewing the importance of order statistics in robust statistics, signal processing, and large‑scale data analytics, the authors first implement and benchmark several well‑known GPU‑based selection strategies: Quickselect, heap‑based selection, radix sort, and bitonic sort. Their CUDA profiling reveals that Quickselect suffers from severe thread divergence and irregular memory accesses, heap‑based methods generate excessive global‑memory traffic due to pointer chasing, radix sort requires costly integer‑to‑float conversions for real‑valued data, and bitonic sort’s O(n log n) complexity together with high shared‑memory consumption limits scalability for very large n.

The core contribution is a novel algorithm that reframes median computation as the minimization of a convex, piecewise‑linear function
(f(t)=\sum_{i=1}^{n}|x_i-t|).
Because f(t) is convex and its sub‑gradient vanishes precisely at the median, the problem becomes a one‑dimensional convex optimization task. The authors propose two parallel solvers: a parallel bisection method and a parallel Newton‑Raphson variant. In each iteration, every thread processes a chunk of the input, computes the sign of (x_i‑t) and the absolute deviation, and participates in a warp‑level reduction (using CUDA’s shuffle intrinsics) to obtain the total sub‑gradient and the function value. The bisection step halves the search interval based on the sign of the aggregated sub‑gradient; the Newton‑Raphson step updates t using an approximate second‑derivative (essentially a step‑size heuristic) to accelerate convergence. Both procedures require only O(n) arithmetic per iteration and O(log ε⁻¹) iterations to reach a tolerance ε, while keeping memory accesses coalesced and avoiding thread divergence.

Implementation details are carefully optimized: the input array is tiled into shared memory per block, reducing global memory bandwidth; intra‑block reductions are performed without explicit synchronization using warp‑shuffle; inter‑block reductions are handled by a two‑kernel scheme that avoids atomic operations. The initial search interval


Comments & Academic Discussion

Loading comments...

Leave a Comment