A SWAR Approach to Counting Ones
We investigate the complexity of algorithms counting ones in different sets of operations. With addition and logical operations (but no shift) $O(\log^2(n))$ steps suffice to count ones. Parity can be computed with complexity $O(\log(n))$, which is the same bound as for methods using shift-operations. If multiplication is available, a solution of time complexity $O(\log^*(n))$ is possible improving the known bound $O(\log\log(n))$.
💡 Research Summary
The paper investigates population‑count (bit‑count) and parity computation under various restricted instruction sets, focusing on SIMD‑within‑a‑register (SWAR) techniques. The authors define three computational models: OPAL (Oblivious Parallel Addition and Logical) which allows only logical operations (AND, OR, XOR) and addition; PAL, which adds flow‑control constructs; and a model that also permits multiplication (and shift).
For the OPAL model they present two algorithms. The first, Theorem 1, partitions an n‑bit word into √n fields of √n bits each, inserts “spacer” bits, and repeatedly deletes the most‑significant 1 in each field using x & (x‑1). Because each field can be emptied in at most √n steps, the overall runtime is O(√n). The method is conceptually simple but requires large constant masks, making it impractical for typical word sizes.
The second OPAL algorithm (Theorem 2) achieves O(log² n) time without multiplication or shift. It repeatedly merges adjacent fields of size 2ᶦ‑1 into fields of size 2ᶦ, using Lemma 1 to move the least‑significant bit of a field to its most‑significant position by a combination of addition and masking. Each merging stage costs O(log n) operations and there are O(log n) stages, yielding O(log² n). The authors provide a concrete C‑like implementation for the lower 16 bits of a 32‑bit word, employing a cascade of masks such as 0x5555, 0x6666, etc.
Parity is treated separately. Theorem 3 shows that, within OPAL, parity can be computed in O(log n) by first forming 2‑bit fields, XOR‑ing each pair, and then propagating the resulting single parity bit across increasingly larger fields using the same left‑shift‑by‑addition trick. Because only one bit per field needs to be moved, the constant factor is smaller than for full counting. A final conditional test converts any non‑zero result to 1, giving a PAL‑compatible O(log n) parity routine (Corollary 2).
When multiplication is allowed, the authors claim a dramatic improvement: Theorem 4 presents an O(log* n) algorithm. The idea is to maintain fields of length k² at iteration k, split each field into two halves, mask them, multiply each half by a specially crafted constant mₖ, shift the products, and combine them with another mask cₖ. This effectively adds the counts of two adjacent halves while keeping the result within 2·(k‑1)² bits. The iteration stops when the field size exceeds n, at which point a final multiplication and shift extracts the total count. Because the field size grows doubly‑exponentially, the number of iterations equals the iterated logarithm log* n, improving on the classic O(log log n) Gillies‑Miller method.
The paper also adapts the O(log* n) technique to parity, replacing the final multiplication by a division based on HAKMEM item 169, resulting in a parity routine that uses only seven “C‑operations”.
A discussion section summarizes the results in a table, contrasting the upper bounds for each operation set: OPAL gives O(log² n), PAL gives O(log n) for parity and O(log² n) for counting, and the multiplication‑enhanced model yields O(log* n). The authors note a lower bound of Ω(log n / log log n) for parity in the broadword model, derived from circuit‑complexity literature, and remark that no algorithm is known with a bound proportional to the actual number of 1‑bits ν(x).
Critically, the paper lacks experimental validation; the constants hidden in the O‑notation are substantial, especially for the O(√n) and O(log² n) methods, which rely on large pre‑computed masks. Consequently, on modern CPUs with hardware POPCNT instructions, the proposed algorithms are unlikely to be faster in practice. Nevertheless, the theoretical contribution is clear: it maps the exact asymptotic limits of bit‑count and parity under progressively richer instruction sets, and introduces a novel iterated‑logarithm algorithm that, while heavy on multiplication and masking, pushes the known bound from O(log log n) down to O(log* n). This work may be of particular interest for specialized architectures where multiplication is cheap but shift or rotate instructions are unavailable, or for theoretical studies of arithmetic circuit complexity.
Comments & Academic Discussion
Loading comments...
Leave a Comment