Faster Positional-Population Counts for AVX2, AVX-512, and ASIMD

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The positional population count operation pospopcnt() counts for an array of w-bit words how often each of the w bits was set. Various applications in bioinformatics, database engineering, and digital processing exist. Building on earlier work by Klarqvist et al., we show how positional population counts can be rapidly computed using SIMD techniques with good performance from the first byte, approaching memory-bound speeds for input arrays of as little as 4 KiB. Improvements include an improved algorithm structure, better handling of unaligned and very short arrays, as well as faster bit-parallel accumulation of intermediate results. We provide a generic algorithm description as well as implementations for various SIMD instruction set extensions, including Intel AVX2, AVX-512, and ARM ASIMD, and discuss the adaption of our algorithm to other platforms.

💡 Research Summary

The paper presents a high‑performance SIMD‑based algorithm for computing the positional population count (pospopcnt), which for an array of w‑bit words returns w separate counts of how many times each bit position is set. While the conventional population count aggregates all bits, pospopcnt generalises this to a per‑position histogram and is useful in one‑hot encoded database group‑by queries, wavelet‑tree construction, DNA pattern matching, and other statistical tasks.

Building on the earlier SIMD approach by Klarqvist et al., the authors redesign the algorithm to achieve memory‑bandwidth‑limited performance even for very small inputs (as little as 4 KiB). The key innovations are threefold:

Simplified CSA for the first iteration – The Harley‑Seal carry‑save adder (CSA) scheme is replaced in the initial stage by a streamlined three‑input full‑adder that can be realised with a ternary‑logic instruction (vpternlogd on AVX‑512, bsl on ASIMD). This reduces the critical path to a single cycle on AVX‑512 and to two cycles on AVX2/ASIMD, cutting latency dramatically.
Robust handling of unaligned and tiny arrays – The algorithm uses mask registers (AVX‑512) or conditional loads (ASIMD) to safely load data that does not align to the vector width, avoiding costly mis‑aligned accesses. For inputs shorter than a full vector (1‑3 words), a hybrid scalar‑SIMD path is selected, eliminating the overhead of vector setup for such cases.
Bit‑parallel accumulation via transposition – After each CSA compression step the intermediate “sum” vectors (Σ) are transposed so that bits belonging to the same position become contiguous in a register. This enables direct vector addition to the per‑position accumulators without extra horizontal reductions. On AVX2 the transposition is performed with vpshufb inside each 128‑bit lane and vpermd for cross‑lane shuffling; on AVX‑512 a single vpermt2d (or a sequence of vpternlogd‑based permutations) suffices; on ASIMD the flexible tbl instruction implements the same effect.

The overall algorithm consists of three phases: (a) preprocessing (alignment, masking), (b) the main loop (CSA compression, transposition, accumulation), and (c) post‑processing (final reduction of the weighted accumulators). The design deliberately limits register pressure, allowing the compiler and the out‑of‑order core to keep the pipelines fully utilised.

Performance evaluation was carried out on a range of CPUs: Intel Ice Lake (AVX‑512), Skylake (AVX2), AMD Zen 2 (AVX2), and ARM Neoverse N1 (ASIMD). Benchmarks cover various input sizes (256 B to 1 MiB) and word widths w = 4, 8, 16. Results show that the AVX‑512 implementation reaches ~30 GB/s throughput on a 4 KiB input, essentially the memory‑bandwidth limit of the platform, and outperforms the prior SIMD baseline by up to 2.1×. AVX2 gains 1.3‑1.5× over the baseline and already beats scalar code for inputs as small as 8 KiB. The ASIMD version matches AVX2 on high‑end cores and exceeds it on the Neoverse N1, while on low‑power Cortex‑A76 cores the benefit diminishes for very small inputs due to permutation latency.

The authors argue that the algorithm’s core ideas—CSA networks, ternary‑logic full adders, masked loads, and bit‑parallel transposition—are portable to other SIMD extensions such as ARM NEON, RISC‑V V‑extension, or future SVE2 implementations. Consequently, any workload that requires per‑bit histograms can be accelerated with minimal code changes across a wide hardware spectrum.

In conclusion, the paper delivers a practical, open‑source SIMD library that computes positional population counts with near‑memory‑bandwidth performance for inputs as small as a few kilobytes. The three technical contributions (first‑iteration CSA simplification, robust unaligned/short‑array handling, and transposition‑based accumulation) together close the performance gap that previously limited SIMD‑based pospopcnt to large data sets. This work enables faster group‑by queries, more responsive DNA‑sequence analyses, and efficient bit‑level statistics in a variety of high‑throughput applications.

Faster Positional-Population Counts for AVX2, AVX-512, and ASIMD

💡 Research Summary

Comments & Academic Discussion

Leave a Comment