Roaring Bitmaps: Implementation of an Optimized Software Library
Modern commodity processors use parallelism to accelerate processing. SIMD instructions offer a particular form of processor parallelism that proves advantageous for processing large volumes of data . Whereas regular instructions operate on a single machine word (e.g., 64 bits), SIMD instructions operate on large registers (e.g., 256 bits) that can be used to represent “vectors” containing several distinct values. For example, a single SIMD instruction can add sixteen 16-bit integers in one 32 B vector register to the corresponding 16-bit integers in another vector register using a single cycle. Moreover, modern processors are capable of superscalar execution of certain vector instructions, allowing the execution of several SIMD instructions per cycle. For example, an Intel Skylake processor is capable of executing two vector additions every cycle .
SIMD instructions are ideally suited for operations between bitset containers. When computing the intersection, union, difference or symmetric difference between two (uncompressed) bitsets, we only have to load the data in registers, apply a logical operation between two words (AND, OR, AND NOT, or XOR) and, optionally, save the result to memory. All of these operations have corresponding SIMD instructions. So, instead of working over 64-bit words, we work over larger words (e.g., 256 bits), dividing the number of instruction (e.g., $`\div 4`$) and giving significantly faster processing.
Historically, SIMD instructions have gotten wider and increasingly powerful. The Pentium 4 was limited to 128-bit instructions. Later x64 processors introduced increasingly powerful 128-bit instruction sets: SSSE3, SSE4, SSE4.1, SSE4.2. Today, we find rich support for 256-bit vector registers in the AVX2 instruction set available in recent x64 processors from Intel (starting with the Haswell microarchitecture, 2013) and AMD (starting with the Excavator microarchitecture, 2015). Upcoming Intel commodity processors will support 512-bit SIMD instructions (AVX-512). In many cases, the benefits of SIMD instructions increase as they get wider, which means that SIMD instructions tend to become more compelling with time.
Compilers can automatically translate C or C++ code into the appropriate SIMD instructions, even when the code does not take into account vectorization (scalar code). However, for greater gains, it is also possible to design algorithms and code that take into account the vector instructions. In these instances, we can use the SIMD intrinsics available in C and C++ to call SIMD instructions without having to use assembly code. See Table 1. These intrinsics are supported across several compilers (LLVM’s Clang, GNU GCC, Intel’s compiler, Visual Studio).
| instruction | C intrinsic | description | rec. throughput |
|---|---|---|---|
vpand |
_mm256_and_si256 |
256-bit AND | 0.33 |
vpor |
_mm256_or_si256 |
256-bit OR | 0.33 |
vpxor |
_mm256_xor_si256 |
256-bit XOR | 0.33 |
vpsadbw |
_mm256_sad_epu8 |
sum of the absolute differences of the byte values to the low 16 bits of each 64-bit word | 1 |
vpaddq |
_mm256_add_epi64 |
add 64-bit integers | 0.5 |
vpshufb |
_mm256_shuffle_epi8 |
shuffle bytes within 128-bit lanes | 1 |
Some relevant AVX and AVX2 instructions with reciprocal throughput$`^{\mathrm{a}}`$ in CPU cycles on recent (Skylake) Intel processors.
The reciprocal throughput is the number of processor clocks it takes for an instruction to execute. A reciprocal throughput of 0.5 means that we execute two such instructions per clock.
Vectorized Population Count Over Bitsets
We can trivially vectorize operations between bitsets. Indeed, it suffices to compute bitwise operations over vectors instead of machine words. By aggressively unrolling the resulting loop, we can produce highly efficient code. Optimizing compilers can often automatically vectorize such code. It is more difficult, however, to also compute the cardinality of the result efficiently. Ideally, we would like to vectorize simultaneously the operations between bitsets and the computation of the cardinality of the result.
Vectorized Harley-Seal Population Count
Commodity processors have dedicated instructions to count the 1-bits in
a word (the “population count”): popcnt for x64 processors and cnt
for the 64-bit ARM architecture. On recent Intel processors, popcnt
has a throughput of one instruction per cycle for both 32-bit and 64-bit
registers .
Maybe surprisingly, Muła et al. found that it is possible to do better . We can use a vectorized approach based on the circuit-based Harley-Seal algorithm . It is inspired by a carry-save adder (CSA) circuit: given 3 bit values ($`a,b,c\in \{0,1\}`$), we can sum them up into the two-bit value $`(a \text{ XOR } b) \text{ XOR } c`$ (least significant) and $`(a \text{ AND } b) \allowbreak \text{ OR } \allowbreak ( (a \text{ XOR } b) \text{ AND } c )`$ (most significant). We can sum 3 individual bit values to a 2-bit counter using 5 logical operations, and we can generalize this approach to 256-bit vectors. Starting with 3 input vectors, we can generate two new output vectors, one containing the least significant bits and one containing the most significant (or carry) bits with 5 bitwise logical operations (two XORs, two ANDs and one OR):
void CSA(__m256i *h, __m256i *l, __m256i a,
__m256i b, __m256i c) {
// starts with a,b,c, sets H,L, u is temp.
__m256i u = _mm256_xor_si256(a, b);
*h = _mm256_or_si256(_mm256_and_si256(a, b),
_mm256_and_si256(u, c));
*l = _mm256_xor_si256(u, c);
}
From such an adder function, we can derive an efficient population-count function, effectively composing 3-input adders into a circuit with 16 inputs, which encodes the population count of its 16 inputs as a 5-bit binary number. For instance, Knuth presents a construction requiring only 63 logical operations in the case of a 16-input circuit. Although using slightly more logical operations (75), the iterated Harley-Seal approach goes further: it increments the existing contents of a 5-bit binary number by the population count.
In a SIMD setting, our Harley-Seal circuit takes sixteen 256-bit vectors
and updates five 256-bit vectors serving as a bit-sliced accumulator :
in other words, if one counts the number of 1s amongst the
least-significant bits of the 16 inputs, the population count (0 to 16)
affects the least-significant bits of the 5 accumulator vectors.
Similarly, the population count of the second-least-significant bits of
the 16 inputs affects the second-least-significant bits of the
accumulator vectors. Equivalently, the 5 accumulator vectors can be
viewed as providing 256 different 5-bit accumulators. The accumulator
vectors are named ones, twos, fours, eights, and sixteens,
reflecting their use in the accumulator: if the least significant bits
of the accumulator vectors are all initially zero and the population
count of the least-significant bits of the 16 input vectors is 9, then
our Harley-Seal approach will result in the least-significant bits of
eights and ones being set, whereas the least-significant bits of
sixteens, fours and twos would be clear. (The accumulator went
from storing 0 to storing 9.) If the accumulator updated a second time
with the same inputs, then the least-significant bits of sixteens and
twos (only) would be set.
Fig. 3 (from ) illustrates a
simplified case with only ones, twos and fours.
Fig. 4 shows the high-level
process: We load 16 vectors from the source bitset , we compute the
bitwise operations, and obtain the result in ones, twos, fours,
eights and sixteens. This processes a 4 kb block of an input bitset
using approximately 75 AVX operations, not considering any data-movement
instructions (see Fig. 2). We can compute the total value of
all 256 accumulators as
$`16\times\mathrm{count}(\texttt{sixteens})+8\times\mathrm{count}(\texttt{eights})+4\times\mathrm{count}(\texttt{fours})+2\times \mathrm{count}(\texttt{twos})+\mathrm{count}(\texttt{ones})`$
(13 more operations), where $`\mathrm{count}(b)`$ denotes the population
count of $`b`$. Iterated 16 times, we can process the entire input
bitset. Moreover, it is only necessary to read the total value of all
256 accumulators after the last iteration. However, we need to ensure
that, at the beginning of each iteration, none of our 5-bit accumulators
ever begins with a count above 15 — otherwise, an accumulator might need
to store a value exceeding 31. Therefore, at the end of each block we
count the 1-bits in each of the four 64-bit words of sixteens and add
the result to a vector counter $`c`$ before zeroing sixteens. To count
the 1-bits in sixteens quickly, we use vector registers as lookup
tables mapping 4-bit values (integers in $`[0,16)`$) to their
corresponding population counts, and the vpshufb can effectively look
up 32 byte values at once. Thus we can divide each byte of sixteens
into its low nibble and high nibble and look up the population count of
each. To get a population count for the 64-bit words, we use the
instruction vpsadbw (with the intrinsic _mm256_sad_epu8): this
instruction adds the absolute values of the differences between byte
values within 64-bit subwords. We illustrate the code in
Fig. 1: the vectors lookuppos and
lookupneg store the positives and negatives of the population counts
with a canceling offset of 4 so that their subtraction is the sum of the
population counts. After the $`16^{\textrm{th}}`$ iteration, the
population count of the entire input bitset is computed as
$`16\times c+8\times\mathrm{count}(\texttt{eights})+4\times\mathrm{count}(\texttt{fours})+2\times \mathrm{count}(\texttt{twos})+\mathrm{count}(\texttt{ones})`$.
|
By contrast, in the approach using the dedicated population-count
instruction, the compiler generates one load, one popcnt and one add
per input 64-bit word. Current x64 processors decompose complex machine
instructions into low-level instructions called . The popcnt approach
generates three per word. For the SIMD approach, we process sixteen
256-bit vectors using 98 including 16 loads, 32 bitwise ANDs (vpand),
15 bitwise ORs (vpor) and 30 bitwise XORs (vpxor)—or about 1.5 to
process each 64-bit word. That is, the vectorized approach generates
half the number of . We analyzed our implementations with the IACA code
analyser for our target x64 microarchitecture
(§ 5.2). Unsurprisingly, IACA predicts a
throughput of one word per cycle with the dedicated population-count
function, and a throughput of slightly more than two words per cycle
with the vectorized approach. Microbenchmarking agrees with IACA and
shows that our approach is roughly twice as fast as relying exclusively
on dedicated population-count instructions.
|
(A[i+0], …, A[i+15])ones and twos, producing
ones, twos and fours . Vectorized Operations Over Bitsets With Population Counts
To aggregate two bitsets (e.g., with the intersection/AND, union/OR, etc.) while maintaining the cardinality, we found that an efficient approach is to load two machine words from each bitset, compute the bitwise operation, apply the population-count instruction on the two resulting words, and write them out before processing the next words. To benefit from vectorization, we could compute the bitwise logical operations using SIMD instructions, and then reload the result from memory and apply the population-count instructions. The interplay between SIMD instructions and a population-count function operating over 64-bit words would not be ideal.
Though an obvious application of the vectorized population count is the efficient computation of the cardinality of a bitset container, we put also it to good use whenever we need to compute the intersection, union, difference or symmetric difference between two bitset containers, while, at the same time, gathering the cardinality of the result. We can also use it to compute the cardinality of the intersection, union, difference or symmetric difference, without materializing them. As reported by Muła et al. , in these cases, the benefits of the vectorized approach can exceed a factor of two. Indeed, if we use a vectorized approach to the population count, we can count the bits directly on the vector registers, without additional data transfer.
Vectorized Intersections Between Arrays
Because array containers represent integers values as sorted arrays of
16-bit integers, we can put to good use an algorithm based on a
vectorized string comparison function present in recent x64 processors
(SSE 4.1’s pcmpistrm). While intended for string comparisons, we can
repurpose the instruction to compute intersections. Each input array is
divided into blocks of eight 16-bit integers. There are two block
pointers (one for each array) initialized to the first blocks. We use
the pcmpistrm instruction to conduct an all-against-all comparison of
16-bit integers between these two blocks and we extract matching
integers from a mask produced by the instruction. The number of 1s in
the resulting mask indicates the number of matching integers. To avoid
the cost of comparing pairs of blocks that cannot match, a merge-like
process considers the maximum element within each block to find those
block-pairs to which pcmpistrm should be applied. That is, whenever
the current maximum value in the block of one array is smaller or equal
to the maximum of the other block, we load the next block. In
particular, when the maxima of the two blocks are equal, we load a new
block in each array. See
Algorithm [algo:geninteralgo].
two non-empty arrays made of a multiple of $`K`$ distinct values in sorted order load a block $`B_1`$ of $`K`$ elements from the first array load a block $`B_2`$ of $`K`$ elements from the second array write the intersection between the two blocks $`B_1\cap B_2`$ to the output load a new block $`B_1`$ of $`K`$ elements from the first array, or terminate if exhausted load a new block $`B_2`$ of $`K`$ elements from the second array, or terminate if exhausted load a new block $`B_1`$ of $`K`$ elements from the first array, or terminate if exhausted load a new block $`B_2`$ of $`K`$ elements from the second array, or terminate if exhausted
We illustrate a simplified function in
Fig. 5. The _mm_cmpistrm intrinsic
compares all 16-bit integers from one 128-bit vector with all 16-bit
integers from another vector, returning a mask. From this mask, we
compute the number of matching values (using the _mm_popcnt_u32
intrinsic to call the popcnt instruction). Moreover, we can shuffle
the values in one of the vectors (vA) using a shuffling intrinsic
(_mm_shuffle_epi8) and a table of precomputed shuffling masks
(mask16). We could design an equivalent AVX2 function using 256-bit
vectors, but AVX2 lacks the equivalent of the _mm_cmpistrm intrinsic.
int32_t intersect(uint16_t *A, size_t lengthA,
uint16_t *B, size_t lengthB,
uint16_t *out) {
size_t count = 0; // size of intersection
size_t i = 0, j = 0;
int vectorlength = sizeof(__m128i) / sizeof(uint16_t);
__m128i vA, vB, rv, sm16;
while (i < lengthA) && (j < lengthB) {
vA = _mm_loadu_si128((__m128i *)&A[i]);
vB = _mm_loadu_si128((__m128i *)&B[j]);
rv = _mm_cmpistrm(vB, vA,
_SIDD_UWORD_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
int r = _mm_extract_epi32(rv, 0);
sm16 = _mm_load_si128(mask16 + r);
__m128i p = _mm_shuffle_epi8(vA, sm16);
_mm_storeu_si128((__m128i *)(out + count),p);
count += _mm_popcnt_u32(r);
uint16_t a_max = A[i + vectorlength - 1];
uint16_t b_max = B[j + vectorlength - 1];
if (a_max <= b_max) i += vectorlength;
if (b_max <= a_max) j += vectorlength;
}
return count;
}
Special-case processing is required when our arrays do not have lengths
divisible by eight: we need to finish the algorithm with a small scalar
intersection algorithm. Special processing is also required if an array
starts with zero, because the pcmpistrm instruction is designed for
string values and it processes null values as special (string ending)
values.
Evidently, we can easily modify this function if we only wish to get the cardinality. For the most part, we merely skip the memory store instructions.
When the pcmpistrm instruction is available, as it is on all recent
x64 processors, it is advantageous to replace a conventional list
intersection function by such a vectorized approach. However, we still
rely on a galloping intersection when one of the two input arrays is
much smaller. For example, the vectorized approach is obviously not
helpful to intersect an array made of 1000 values with one made of
two values. There are vectorized versions of the galloping intersection
which could further improve our performance in these cases, but we leave
such an optimization for future work.
Vectorized Unions Between Arrays
To compute the union between two array containers, we adapt an approach originally developed for merge sort using SIMD instructions . Given two sorted vectors, we can generate two sorted output vectors that are the result of a “merger” between the two input vectors, by using a sorting network whose branch-free implementation uses SIMD minimum and maximum instructions. We can compute such a merger with eight minimum instructions and eight maximum instructions, given two vectors made of eight 16-bit integers (see Fig. 6).
We can put this fast merge function to good use by initially loading one vector from each sorted array and merging them, so that the small values go into one vector ($`B_1`$) and the large values go into the other ($`B_2`$). Vector $`B_1`$ can be queued for output. We then advance a pointer in one of the two arrays, choosing the array with a smallest new value, and then load a new vector. We merge this new vector with $`B_2`$. We repeat this process; see Algorithm [algo:genunionalgo]. We terminate with a scalar merge because the arrays’ lengths might not be a multiple of the vector size.
two non-empty arrays made of a multiple of $`K`$ distinct values in sorted order load a block $`B_1`$ of $`K`$ elements from the first array load a block $`B_2`$ of $`K`$ elements from the second array Merge blocks $`B_1`$ and $`B_2`$ so that the smallest elements are in block $`B_1`$ and the largest are in block $`B_2`$. Output the elements from $`B_1`$, after removing the duplicates load a new block $`B_1`$ of $`K`$ elements from the first array load a new block $`B_1`$ of $`K`$ elements from the second array Merge blocks $`B_1`$ and $`B_2`$ so that the smallest elements are in block $`B_1`$ and the largest are in block $`B_2`$ output the elements from $`B_1`$, after removing the duplicates while taking into account the last value written to output load a new block $`B_1`$ of $`K`$ elements from the remaining array Merge blocks $`B_1`$ and $`B_2`$ so that the smallest elements are in block $`B_1`$ and the largest are in block $`B_2`$, output the elements from $`B_1`$, after removing the duplicates while taking into account the last value written to output Output the elements from $`B_2`$, after removing the duplicates
The queued vector might include duplicate values. To “deduplicate” them we use a single vectorized comparison between each new vector with a version of itself offset back by one element (taking care to handle the final previously stored value). For the mask resulting from this comparison, we can quickly extract the integers that need to be written, using a look-up table and an appropriate permutation. See Fig. 7.
Our vectorized approach is implemented to minimize mispredicted branches. We use 128-bit SIMD vectors instead of 256-bit AVX vectors. Larger vectors would further reduce the number of branches, but informal experiments indicate that the gains are modest. A downside of using larger vectors is that more of the processing occurs using scalar code, when the array lengths are not divisible by the vector lengths.
|
vecMin, vecMax)
containing the sixteen integer inputs in a sorted sequence, with the
eight smallest integers in vecMin, and eight largest in
vecMax. |
Vectorized Differences Between Arrays
Unlike union and intersection, the difference operator is an asymmetric operation: $`A \setminus B`$ differs from $`B \setminus A`$. We remove the elements of the second array from the first array.
If we can compute the intersection quickly between blocks of data (e.g.,
with the _mm_cmpistrm intrinsic), we can also quickly obtain the
difference. In our vectorized intersection algorithm
(§ 4.2), we identify the elements that
are part of the intersection using a bitset. In a difference algorithm,
picking blocks of values from the first array, we can successively
remove the elements that are present in the second array, by iterating
through its data in blocks. The intersection, expressed as a bitset,
marks the elements for deletion, and we can conveniently accumulate
these indications by taking the bitwise OR of the bitsets. Indeed,
suppose that we want to compute $`A \setminus B`$ where
$`A= \{1,3,5,7\}`$ and $`B=\{1,2,3,4,6,7,8,9\}`$ using blocks of
4 integers:
-
We compute the intersection between the first (and only) block from $`A`$ given by $`\{1,3,5,7\}`$ and the first block from $`B`$ given by $`\{1,2,3,4\}`$. We find that $`1,3`$ are part of the intersection, but not $`5,7`$. We can represent this result with the bitset
0b0011(or0x3). -
We advance to the next block in the second array, loading $`\{6,7,8,9\}`$. The intersection between the two blocks is $`\{1,3,5,7\}\cap\{6,7,8,9\}=\{7\}`$ which we can represent with the bitset
0b1000. -
We can compute the bitwise OR between the two bitsets
0b0011and0b1000to get0b1011. It indicates that all but the third value in the block $`\{1,3,5,7\}`$ must be removed. So we output $`\{3\}`$.
Algorithm [algo:gendiffalgo] describes the difference algorithm at a high level.
As with our other algorithms, we need to finish the computation using scalar code because the inputs may not be divisible by the vector size. Moreover, if we only need to compute the size of the difference, we can use the same algorithm, but we simply do not store the result, only tracking the cardinality.
two non-empty arrays made of a multiple of $`K`$ distinct values in sorted order load a block $`B_1`$ of $`K`$ elements from the first array load a block $`B_2`$ of $`K`$ elements from the second array mark for deletion the elements in $`B_1\cap B_2`$ write $`B_1`$ to the output after removing the elements marked for deletion load a new block $`B_1`$ of $`K`$ elements from the first array, or terminate if exhausted load a new block $`B_2`$ of $`K`$ elements from the second array, or break the loop if exhausted write $`B_1`$ to the output after removing the elements marked for deletion load a new block $`B_1`$ of $`K`$ elements from the first array, or terminate if exhausted load a new block $`B_2`$ of $`K`$ elements from the second array, or break the loop if exhausted write $`B_1`$ after removing the elements marked for deletion as well as all unprocessed data from the first array to the output
Vectorized Symmetric Differences Between Arrays
The symmetric difference can be computed using a block-based algorithm similar to the union algorithm (Algorithm [algo:genunionalgo]). When computing the union, we remove duplicates, keeping just one instance of a value repeated twice. E.g., given $`\{1,1,2,2,3\}`$, we want to output $`\{1,2,3\}`$. When computing the symmetric difference, we need to entirely remove a value if it is repeated. Thus, given $`\{1,1,2,2,3\}`$, we want to get the set $`\{3\}`$. We can use a fast vectorized function for this purpose (see Fig. 8), and it is similar to the corresponding deduplication function used for the vectorized union (Fig. 7).
In the union, we can immediately write all values after removing
duplicates. With the symmetric difference, the largest value in the
current block should not be written immediately. Indeed, we need to
account for the fact that the next block might start with this largest
value, and that it may become necessary to omit it. Thus, we never write
the current largest value of the current block when computing the
symmetric difference. Instead, it is (potentially) added when the next
block is processed — note that val in
Fig. 8 is computed from v2 (which
has this value), rather than from n. Of course, when the algorithm
terminates, we must write the largest value encountered if it were not
repeated. In any case, we need to terminate the algorithm with scalar
code given that the size of our input lists may not be divisible by the
vector size.
|