Roaring Bitmaps: Implementation of an Optimized Software Library

Modern commodity processors use parallelism to accelerate processing. SIMD instructions offer a particular form of processor parallelism that proves advantageous for processing large volumes of data . Whereas regular instructions operate on a single machine word (e.g., 64 bits), SIMD instructions operate on large registers (e.g., 256 bits) that can be used to represent “vectors” containing several distinct values. For example, a single SIMD instruction can add sixteen 16-bit integers in one 32 B vector register to the corresponding 16-bit integers in another vector register using a single cycle. Moreover, modern processors are capable of superscalar execution of certain vector instructions, allowing the execution of several SIMD instructions per cycle. For example, an Intel Skylake processor is capable of executing two vector additions every cycle .

SIMD instructions are ideally suited for operations between bitset containers. When computing the intersection, union, difference or symmetric difference between two (uncompressed) bitsets, we only have to load the data in registers, apply a logical operation between two words (AND, OR, AND NOT, or XOR) and, optionally, save the result to memory. All of these operations have corresponding SIMD instructions. So, instead of working over 64-bit words, we work over larger words (e.g., 256 bits), dividing the number of instruction (e.g., $`\div 4`$) and giving significantly faster processing.

Historically, SIMD instructions have gotten wider and increasingly powerful. The Pentium 4 was limited to 128-bit instructions. Later x64 processors introduced increasingly powerful 128-bit instruction sets: SSSE3, SSE4, SSE4.1, SSE4.2. Today, we find rich support for 256-bit vector registers in the AVX2 instruction set available in recent x64 processors from Intel (starting with the Haswell microarchitecture, 2013) and AMD (starting with the Excavator microarchitecture, 2015). Upcoming Intel commodity processors will support 512-bit SIMD instructions (AVX-512). In many cases, the benefits of SIMD instructions increase as they get wider, which means that SIMD instructions tend to become more compelling with time.

Compilers can automatically translate C or C++ code into the appropriate SIMD instructions, even when the code does not take into account vectorization (scalar code). However, for greater gains, it is also possible to design algorithms and code that take into account the vector instructions. In these instances, we can use the SIMD intrinsics available in C and C++ to call SIMD instructions without having to use assembly code. See Table 1. These intrinsics are supported across several compilers (LLVM’s Clang, GNU GCC, Intel’s compiler, Visual Studio).

instruction	C intrinsic	description	rec. throughput
`vpand`	`_mm256_and_si256`	256-bit AND	0.33
`vpor`	`_mm256_or_si256`	256-bit OR	0.33
`vpxor`	`_mm256_xor_si256`	256-bit XOR	0.33
`vpsadbw`	`_mm256_sad_epu8`	sum of the absolute differences of the byte values to the low 16 bits of each 64-bit word	1
`vpaddq`	`_mm256_add_epi64`	add 64-bit integers	0.5
`vpshufb`	`_mm256_shuffle_epi8`	shuffle bytes within 128-bit lanes	1

Some relevant AVX and AVX2 instructions with reciprocal throughput$`^{\mathrm{a}}`$ in CPU cycles on recent (Skylake) Intel processors.

The reciprocal throughput is the number of processor clocks it takes for an instruction to execute. A reciprocal throughput of 0.5 means that we execute two such instructions per clock.

Vectorized Population Count Over Bitsets

We can trivially vectorize operations between bitsets. Indeed, it suffices to compute bitwise operations over vectors instead of machine words. By aggressively unrolling the resulting loop, we can produce highly efficient code. Optimizing compilers can often automatically vectorize such code. It is more difficult, however, to also compute the cardinality of the result efficiently. Ideally, we would like to vectorize simultaneously the operations between bitsets and the computation of the cardinality of the result.

Vectorized Harley-Seal Population Count

Commodity processors have dedicated instructions to count the 1-bits in a word (the “population count”): popcnt for x64 processors and cnt for the 64-bit ARM architecture. On recent Intel processors, popcnt has a throughput of one instruction per cycle for both 32-bit and 64-bit registers .

Maybe surprisingly, Muła et al. found that it is possible to do better . We can use a vectorized approach based on the circuit-based Harley-Seal algorithm . It is inspired by a carry-save adder (CSA) circuit: given 3 bit values ($`a,b,c\in \{0,1\}`$), we can sum them up into the two-bit value $`(a \text{ XOR } b) \text{ XOR } c`$ (least significant) and $`(a \text{ AND } b) \allowbreak \text{ OR } \allowbreak ( (a \text{ XOR } b) \text{ AND } c )`$ (most significant). We can sum 3 individual bit values to a 2-bit counter using 5 logical operations, and we can generalize this approach to 256-bit vectors. Starting with 3 input vectors, we can generate two new output vectors, one containing the least significant bits and one containing the most significant (or carry) bits with 5 bitwise logical operations (two XORs, two ANDs and one OR):

void CSA(__m256i *h, __m256i *l, __m256i a,
  __m256i b, __m256i c) {
  // starts with a,b,c, sets H,L, u is temp.
  __m256i u = _mm256_xor_si256(a, b);
  *h = _mm256_or_si256(_mm256_and_si256(a, b),
    _mm256_and_si256(u, c));
  *l = _mm256_xor_si256(u, c);
}

From such an adder function, we can derive an efficient population-count function, effectively composing 3-input adders into a circuit with 16 inputs, which encodes the population count of its 16 inputs as a 5-bit binary number. For instance, Knuth presents a construction requiring only 63 logical operations in the case of a 16-input circuit. Although using slightly more logical operations (75), the iterated Harley-Seal approach goes further: it increments the existing contents of a 5-bit binary number by the population count.

In a SIMD setting, our Harley-Seal circuit takes sixteen 256-bit vectors and updates five 256-bit vectors serving as a bit-sliced accumulator : in other words, if one counts the number of 1s amongst the least-significant bits of the 16 inputs, the population count (0 to 16) affects the least-significant bits of the 5 accumulator vectors. Similarly, the population count of the second-least-significant bits of the 16 inputs affects the second-least-significant bits of the accumulator vectors. Equivalently, the 5 accumulator vectors can be viewed as providing 256 different 5-bit accumulators. The accumulator vectors are named ones, twos, fours, eights, and sixteens, reflecting their use in the accumulator: if the least significant bits of the accumulator vectors are all initially zero and the population count of the least-significant bits of the 16 input vectors is 9, then our Harley-Seal approach will result in the least-significant bits of eights and ones being set, whereas the least-significant bits of sixteens, fours and twos would be clear. (The accumulator went from storing 0 to storing 9.) If the accumulator updated a second time with the same inputs, then the least-significant bits of sixteens and twos (only) would be set.

Fig. 3 (from ) illustrates a simplified case with only ones, twos and fours. Fig. 4 shows the high-level process: We load 16 vectors from the source bitset , we compute the bitwise operations, and obtain the result in ones, twos, fours, eights and sixteens. This processes a 4 kb block of an input bitset using approximately 75 AVX operations, not considering any data-movement instructions (see Fig. 2). We can compute the total value of all 256 accumulators as $`16\times\mathrm{count}(\texttt{sixteens})+8\times\mathrm{count}(\texttt{eights})+4\times\mathrm{count}(\texttt{fours})+2\times \mathrm{count}(\texttt{twos})+\mathrm{count}(\texttt{ones})`$ (13 more operations), where $`\mathrm{count}(b)`$ denotes the population count of $`b`$. Iterated 16 times, we can process the entire input bitset. Moreover, it is only necessary to read the total value of all 256 accumulators after the last iteration. However, we need to ensure that, at the beginning of each iteration, none of our 5-bit accumulators ever begins with a count above 15 — otherwise, an accumulator might need to store a value exceeding 31. Therefore, at the end of each block we count the 1-bits in each of the four 64-bit words of sixteens and add the result to a vector counter $`c`$ before zeroing sixteens. To count the 1-bits in sixteens quickly, we use vector registers as lookup tables mapping 4-bit values (integers in $`[0,16)`$) to their corresponding population counts, and the vpshufb can effectively look up 32 byte values at once. Thus we can divide each byte of sixteens into its low nibble and high nibble and look up the population count of each. To get a population count for the 64-bit words, we use the instruction vpsadbw (with the intrinsic _mm256_sad_epu8): this instruction adds the absolute values of the differences between byte values within 64-bit subwords. We illustrate the code in Fig. 1: the vectors lookuppos and lookupneg store the positives and negatives of the population counts with a canceling offset of 4 so that their subtraction is the sum of the population counts. After the $`16^{\textrm{th}}`$ iteration, the population count of the entire input bitset is computed as $`16\times c+8\times\mathrm{count}(\texttt{eights})+4\times\mathrm{count}(\texttt{fours})+2\times \mathrm{count}(\texttt{twos})+\mathrm{count}(\texttt{ones})`$.

__m256i popcount256(__m256i v) {
  __m256i lookuppos = _mm256_setr_epi8(
        4 + 0, 4 + 1, 4 + 1, 4 + 2,
        4 + 1, 4 + 2, 4 + 2, 4 + 3,
        4 + 1, 4 + 2, 4 + 2, 4 + 3,
        4 + 2, 4 + 3, 4 + 3, 4 + 4,
        4 + 0, 4 + 1, 4 + 1, 4 + 2,
        4 + 1, 4 + 2, 4 + 2, 4 + 3,
        4 + 1, 4 + 2, 4 + 2, 4 + 3,
        4 + 2, 4 + 3, 4 + 3, 4 + 4);
  __m256i lookupneg = _mm256_setr_epi8(
        4 - 0, 4 - 1, 4 - 1, 4 - 2,
        4 - 1, 4 - 2, 4 - 2, 4 - 3,
        4 - 1, 4 - 2, 4 - 2, 4 - 3,
        4 - 2, 4 - 3, 4 - 3, 4 - 4,
        4 - 0, 4 - 1, 4 - 1, 4 - 2,
        4 - 1, 4 - 2, 4 - 2, 4 - 3,
        4 - 1, 4 - 2, 4 - 2, 4 - 3,
        4 - 2, 4 - 3, 4 - 3, 4 - 4);
  __m256i low_mask = _mm256_set1_epi8(0x0f);
  __m256i lo = _mm256_and_si256(v, low_mask);
  __m256i hi = _mm256_and_si256(_mm256_srli_epi16(v, 4),
    low_mask);
  __m256i popcnt1 = _mm256_shuffle_epi8(lookuppos, lo);
  __m256i popcnt2 = _mm256_shuffle_epi8(lookupneg, hi);
  return _mm256_sad_epu8(popcnt1, popcnt2);
}

A C function using AVX2 intrinsics to compute the four population counts of the four 64-bit words in a 256-bit vector.

By contrast, in the approach using the dedicated population-count instruction, the compiler generates one load, one popcnt and one add per input 64-bit word. Current x64 processors decompose complex machine instructions into low-level instructions called . The popcnt approach generates three per word. For the SIMD approach, we process sixteen 256-bit vectors using 98 including 16 loads, 32 bitwise ANDs (vpand), 15 bitwise ORs (vpor) and 30 bitwise XORs (vpxor)—or about 1.5 to process each 64-bit word. That is, the vectorized approach generates half the number of . We analyzed our implementations with the IACA code analyser for our target x64 microarchitecture (§ 5.2). Unsurprisingly, IACA predicts a throughput of one word per cycle with the dedicated population-count function, and a throughput of slightly more than two words per cycle with the vectorized approach. Microbenchmarking agrees with IACA and shows that our approach is roughly twice as fast as relying exclusively on dedicated population-count instructions.

CSA(&twosA,&ones,ones,A[i],A[i + 1]);
CSA(&twosB,&ones,ones,A[i + 2],A[i + 3]);
CSA(&foursA,&twos,twos,twosA,twosB);
CSA(&twosA,&ones,ones,A[i + 4],A[i + 5]);
CSA(&twosB,&ones,ones,A[i + 6],A[i + 7]);
CSA(&foursB,&twos,twos,twosA,twosB);
CSA(&eightsA,&fours,fours,foursA,foursB);
CSA(&twosA,&ones,ones,A[i + 8],A[i + 9]);
CSA(&twosB,&ones,ones,A[i + 10],A[i + 11]);
CSA(&foursA,&twos,twos,twosA,twosB);
CSA(&twosA,&ones,ones,A[i + 12],A[i + 13]);
CSA(&twosB,&ones,ones,A[i + 14],A[i + 15]);
CSA(&foursB,&twos,twos,twosA,twosB);
CSA(&eightsB,&fours,fours,foursA,foursB);
CSA(&sixteens,&eights,eights,eightsA,eightsB);

Accumulator circuit over sixteen inputs (A[i+0], …, A[i+15])

Illustration of the Harley-Seal algorithm aggregating four new inputs (d_i, d_i + 1, d_i + 2, d_i + 3) to inputs ones and twos, producing ones, twos and fours .

Vectorized population count for a 4 kb bitset, folding sixteen 256-bit inputs into five 256-bit outputs.

Vectorized Operations Over Bitsets With Population Counts

To aggregate two bitsets (e.g., with the intersection/AND, union/OR, etc.) while maintaining the cardinality, we found that an efficient approach is to load two machine words from each bitset, compute the bitwise operation, apply the population-count instruction on the two resulting words, and write them out before processing the next words. To benefit from vectorization, we could compute the bitwise logical operations using SIMD instructions, and then reload the result from memory and apply the population-count instructions. The interplay between SIMD instructions and a population-count function operating over 64-bit words would not be ideal.

Though an obvious application of the vectorized population count is the efficient computation of the cardinality of a bitset container, we put also it to good use whenever we need to compute the intersection, union, difference or symmetric difference between two bitset containers, while, at the same time, gathering the cardinality of the result. We can also use it to compute the cardinality of the intersection, union, difference or symmetric difference, without materializing them. As reported by Muła et al. , in these cases, the benefits of the vectorized approach can exceed a factor of two. Indeed, if we use a vectorized approach to the population count, we can count the bits directly on the vector registers, without additional data transfer.

Vectorized Intersections Between Arrays

Because array containers represent integers values as sorted arrays of 16-bit integers, we can put to good use an algorithm based on a vectorized string comparison function present in recent x64 processors (SSE 4.1’s pcmpistrm). While intended for string comparisons, we can repurpose the instruction to compute intersections. Each input array is divided into blocks of eight 16-bit integers. There are two block pointers (one for each array) initialized to the first blocks. We use the pcmpistrm instruction to conduct an all-against-all comparison of 16-bit integers between these two blocks and we extract matching integers from a mask produced by the instruction. The number of 1s in the resulting mask indicates the number of matching integers. To avoid the cost of comparing pairs of blocks that cannot match, a merge-like process considers the maximum element within each block to find those block-pairs to which pcmpistrm should be applied. That is, whenever the current maximum value in the block of one array is smaller or equal to the maximum of the other block, we load the next block. In particular, when the maxima of the two blocks are equal, we load a new block in each array. See Algorithm [algo:geninteralgo].

two non-empty arrays made of a multiple of $`K`$ distinct values in sorted order load a block $`B_1`$ of $`K`$ elements from the first array load a block $`B_2`$ of $`K`$ elements from the second array write the intersection between the two blocks $`B_1\cap B_2`$ to the output load a new block $`B_1`$ of $`K`$ elements from the first array, or terminate if exhausted load a new block $`B_2`$ of $`K`$ elements from the second array, or terminate if exhausted load a new block $`B_1`$ of $`K`$ elements from the first array, or terminate if exhausted load a new block $`B_2`$ of $`K`$ elements from the second array, or terminate if exhausted

We illustrate a simplified function in Fig. 5. The _mm_cmpistrm intrinsic compares all 16-bit integers from one 128-bit vector with all 16-bit integers from another vector, returning a mask. From this mask, we compute the number of matching values (using the _mm_popcnt_u32 intrinsic to call the popcnt instruction). Moreover, we can shuffle the values in one of the vectors (vA) using a shuffling intrinsic (_mm_shuffle_epi8) and a table of precomputed shuffling masks (mask16). We could design an equivalent AVX2 function using 256-bit vectors, but AVX2 lacks the equivalent of the _mm_cmpistrm intrinsic.

int32_t intersect(uint16_t *A, size_t lengthA,
    uint16_t *B, size_t lengthB,
    uint16_t *out) {
  size_t count = 0; // size of intersection
  size_t i = 0, j = 0;
  int vectorlength = sizeof(__m128i) / sizeof(uint16_t);
  __m128i vA, vB, rv, sm16;
  while (i < lengthA) && (j < lengthB) {
    vA = _mm_loadu_si128((__m128i *)&A[i]);
    vB = _mm_loadu_si128((__m128i *)&B[j]);
    rv = _mm_cmpistrm(vB, vA,
      _SIDD_UWORD_OPS | _SIDD_CMP_EQUAL_ANY | _SIDD_BIT_MASK);
    int r = _mm_extract_epi32(rv, 0);
    sm16 = _mm_load_si128(mask16 + r);
    __m128i p = _mm_shuffle_epi8(vA, sm16);
    _mm_storeu_si128((__m128i *)(out + count),p);
    count += _mm_popcnt_u32(r);
    uint16_t a_max = A[i + vectorlength - 1];
    uint16_t b_max = B[j + vectorlength - 1];
    if (a_max <= b_max) i += vectorlength;
    if (b_max <= a_max) j += vectorlength;
  }
  return count;
}

Optimized intersection function between two sorted 16-bit arrays using SSE4.1 intrinsics.

Special-case processing is required when our arrays do not have lengths divisible by eight: we need to finish the algorithm with a small scalar intersection algorithm. Special processing is also required if an array starts with zero, because the pcmpistrm instruction is designed for string values and it processes null values as special (string ending) values.

Evidently, we can easily modify this function if we only wish to get the cardinality. For the most part, we merely skip the memory store instructions.

When the pcmpistrm instruction is available, as it is on all recent x64 processors, it is advantageous to replace a conventional list intersection function by such a vectorized approach. However, we still rely on a galloping intersection when one of the two input arrays is much smaller. For example, the vectorized approach is obviously not helpful to intersect an array made of 1000 values with one made of two values. There are vectorized versions of the galloping intersection which could further improve our performance in these cases, but we leave such an optimization for future work.

Vectorized Unions Between Arrays

To compute the union between two array containers, we adapt an approach originally developed for merge sort using SIMD instructions . Given two sorted vectors, we can generate two sorted output vectors that are the result of a “merger” between the two input vectors, by using a sorting network whose branch-free implementation uses SIMD minimum and maximum instructions. We can compute such a merger with eight minimum instructions and eight maximum instructions, given two vectors made of eight 16-bit integers (see Fig. 6).

We can put this fast merge function to good use by initially loading one vector from each sorted array and merging them, so that the small values go into one vector ($`B_1`$) and the large values go into the other ($`B_2`$). Vector $`B_1`$ can be queued for output. We then advance a pointer in one of the two arrays, choosing the array with a smallest new value, and then load a new vector. We merge this new vector with $`B_2`$. We repeat this process; see Algorithm [algo:genunionalgo]. We terminate with a scalar merge because the arrays’ lengths might not be a multiple of the vector size.

two non-empty arrays made of a multiple of $`K`$ distinct values in sorted order load a block $`B_1`$ of $`K`$ elements from the first array load a block $`B_2`$ of $`K`$ elements from the second array Merge blocks $`B_1`$ and $`B_2`$ so that the smallest elements are in block $`B_1`$ and the largest are in block $`B_2`$. Output the elements from $`B_1`$, after removing the duplicates load a new block $`B_1`$ of $`K`$ elements from the first array load a new block $`B_1`$ of $`K`$ elements from the second array Merge blocks $`B_1`$ and $`B_2`$ so that the smallest elements are in block $`B_1`$ and the largest are in block $`B_2`$ output the elements from $`B_1`$, after removing the duplicates while taking into account the last value written to output load a new block $`B_1`$ of $`K`$ elements from the remaining array Merge blocks $`B_1`$ and $`B_2`$ so that the smallest elements are in block $`B_1`$ and the largest are in block $`B_2`$, output the elements from $`B_1`$, after removing the duplicates while taking into account the last value written to output Output the elements from $`B_2`$, after removing the duplicates

The queued vector might include duplicate values. To “deduplicate” them we use a single vectorized comparison between each new vector with a version of itself offset back by one element (taking care to handle the final previously stored value). For the mask resulting from this comparison, we can quickly extract the integers that need to be written, using a look-up table and an appropriate permutation. See Fig. 7.

Our vectorized approach is implemented to minimize mispredicted branches. We use 128-bit SIMD vectors instead of 256-bit AVX vectors. Larger vectors would further reduce the number of branches, but informal experiments indicate that the gains are modest. A downside of using larger vectors is that more of the processing occurs using scalar code, when the array lengths are not divisible by the vector lengths.

void sse_merge(const __m128i *vInput1,
      const __m128i *vInput2,// input 1 & 2
      __m128i *vecMin, __m128i *vecMax) {
  vecTmp = _mm_min_epu16(*vInput1, *vInput2);
  *vecMax = _mm_max_epu16(*vInput1, *vInput2);
  vecTmp = _mm_alignr_epi8(vecTmp, vecTmp, 2);
  *vecMin = _mm_min_epu16(vecTmp, *vecMax);
  *vecMax = _mm_max_epu16(vecTmp, *vecMax);
  vecTmp = _mm_alignr_epi8(*vecMin, *vecMin, 2);
  *vecMin = _mm_min_epu16(vecTmp, *vecMax);
  *vecMax = _mm_max_epu16(vecTmp, *vecMax);
  vecTmp = _mm_alignr_epi8(*vecMin, *vecMin, 2);
  *vecMin = _mm_min_epu16(vecTmp, *vecMax);
  *vecMax = _mm_max_epu16(vecTmp, *vecMax);
  vecTmp = _mm_alignr_epi8(*vecMin, *vecMin, 2);
  *vecMin = _mm_min_epu16(vecTmp, *vecMax);
  *vecMax = _mm_max_epu16(vecTmp, *vecMax);
  vecTmp = _mm_alignr_epi8(*vecMin, *vecMin, 2);
  *vecMin = _mm_min_epu16(vecTmp, *vecMax);
  *vecMax = _mm_max_epu16(vecTmp, *vecMax);
  vecTmp = _mm_alignr_epi8(*vecMin, *vecMin, 2);
  *vecMin = _mm_min_epu16(vecTmp, *vecMax);
  *vecMax = _mm_max_epu16(vecTmp, *vecMax);
  vecTmp = _mm_alignr_epi8(*vecMin, *vecMin, 2);
  *vecMin = _mm_min_epu16(vecTmp, *vecMax);
  *vecMax = _mm_max_epu16(vecTmp, *vecMax);
  *vecMin = _mm_alignr_epi8(*vecMin, *vecMin, 2);
}

Fast merge function using Intel SIMD instrinsics, takes two sorted arrays of eight 16-bit integers and produces two vectors (vecMin, vecMax) containing the sixteen integer inputs in a sorted sequence, with the eight smallest integers in vecMin, and eight largest in vecMax.

// returns how many unique values are found in n
// the last integer of "o" is last written value
int store_unique(__m128i o, __m128i n, uint16_t *output) {
    // new vector v starting with last value of "o"
    // and containing the 7 first values of "n"
    __m128i v = _mm_alignr_epi8(n, o, 16 - 2);
    // compare v and n and create a mask
    int M = _mm_movemask_epi8(
        _mm_packs_epi16(_mm_cmpeq_epi16(v,n), _mm_setzero_si128()));
    // "number" represents the number of unique values
    int number = 8 - _mm_popcnt_u32(M);
    // load a "shuffling" mask
    __m128i key = _mm_loadu_si128(uniqshuf + M);
    // the first "number" values of "val" are the unique values
    __m128i val = _mm_shuffle_epi8(n, key);
    // we store the result to "output"
    _mm_storeu_si128((__m128i *)output, val);
    return number;
}

Fast function that takes an vector containing sorted values with duplicates, and stores the values with the duplicates removed, returning how many values were written. We provide a second vector containing the last value written as its last component.

Vectorized Differences Between Arrays

Unlike union and intersection, the difference operator is an asymmetric operation: $`A \setminus B`$ differs from $`B \setminus A`$. We remove the elements of the second array from the first array.

If we can compute the intersection quickly between blocks of data (e.g., with the _mm_cmpistrm intrinsic), we can also quickly obtain the difference. In our vectorized intersection algorithm (§ 4.2), we identify the elements that are part of the intersection using a bitset. In a difference algorithm, picking blocks of values from the first array, we can successively remove the elements that are present in the second array, by iterating through its data in blocks. The intersection, expressed as a bitset, marks the elements for deletion, and we can conveniently accumulate these indications by taking the bitwise OR of the bitsets. Indeed, suppose that we want to compute $`A \setminus B`$ where $`A= \{1,3,5,7\}`$ and $`B=\{1,2,3,4,6,7,8,9\}`$ using blocks of 4 integers:

We compute the intersection between the first (and only) block from $`A`$ given by $`\{1,3,5,7\}`$ and the first block from $`B`$ given by $`\{1,2,3,4\}`$. We find that $`1,3`$ are part of the intersection, but not $`5,7`$. We can represent this result with the bitset 0b0011 (or 0x3).
We advance to the next block in the second array, loading $`\{6,7,8,9\}`$. The intersection between the two blocks is $`\{1,3,5,7\}\cap\{6,7,8,9\}=\{7\}`$ which we can represent with the bitset 0b1000.
We can compute the bitwise OR between the two bitsets 0b0011 and 0b1000 to get 0b1011. It indicates that all but the third value in the block $`\{1,3,5,7\}`$ must be removed. So we output $`\{3\}`$.

Algorithm [algo:gendiffalgo] describes the difference algorithm at a high level.

As with our other algorithms, we need to finish the computation using scalar code because the inputs may not be divisible by the vector size. Moreover, if we only need to compute the size of the difference, we can use the same algorithm, but we simply do not store the result, only tracking the cardinality.

two non-empty arrays made of a multiple of $`K`$ distinct values in sorted order load a block $`B_1`$ of $`K`$ elements from the first array load a block $`B_2`$ of $`K`$ elements from the second array mark for deletion the elements in $`B_1\cap B_2`$ write $`B_1`$ to the output after removing the elements marked for deletion load a new block $`B_1`$ of $`K`$ elements from the first array, or terminate if exhausted load a new block $`B_2`$ of $`K`$ elements from the second array, or break the loop if exhausted write $`B_1`$ to the output after removing the elements marked for deletion load a new block $`B_1`$ of $`K`$ elements from the first array, or terminate if exhausted load a new block $`B_2`$ of $`K`$ elements from the second array, or break the loop if exhausted write $`B_1`$ after removing the elements marked for deletion as well as all unprocessed data from the first array to the output

Vectorized Symmetric Differences Between Arrays

The symmetric difference can be computed using a block-based algorithm similar to the union algorithm (Algorithm [algo:genunionalgo]). When computing the union, we remove duplicates, keeping just one instance of a value repeated twice. E.g., given $`\{1,1,2,2,3\}`$, we want to output $`\{1,2,3\}`$. When computing the symmetric difference, we need to entirely remove a value if it is repeated. Thus, given $`\{1,1,2,2,3\}`$, we want to get the set $`\{3\}`$. We can use a fast vectorized function for this purpose (see Fig. 8), and it is similar to the corresponding deduplication function used for the vectorized union (Fig. 7).

In the union, we can immediately write all values after removing duplicates. With the symmetric difference, the largest value in the current block should not be written immediately. Indeed, we need to account for the fact that the next block might start with this largest value, and that it may become necessary to omit it. Thus, we never write the current largest value of the current block when computing the symmetric difference. Instead, it is (potentially) added when the next block is processed — note that val in Fig. 8 is computed from v2 (which has this value), rather than from n. Of course, when the algorithm terminates, we must write the largest value encountered if it were not repeated. In any case, we need to terminate the algorithm with scalar code given that the size of our input lists may not be divisible by the vector size.

int store_xor(__m128i o, __m128i n, uint16_t *output) {
    __m128i v1 = _mm_alignr_epi8(n, o, 16 - 4);
    __m128i v2 = _mm_alignr_epi8(n, o, 16 - 2);
    __m128i el = _mm_cmpeq_epi16(v1, v2);
    __m128i er = _mm_cmpeq_epi16(v2, n);
    __m128i erl = _mm_or_si128(el,er);
    int M = _mm_movemask_epi8(
        _mm_packs_epi16(erl, _mm_setzero_si128()));
    int nv = 8 - _mm_popcnt_u32(M);
    __m128i key = _mm_lddqu_si128(uniqshuf + M);
    __m128i val = _mm_shuffle_epi8(v2, key);
    _mm_storeu_si128((__m128i *)output, val);
    return nv;
}

Fast function taking a vector containing sorted values with duplicates and entirely removing the duplicated values, returning how many values were written. We provide a second vector containing the last value written as its last component.