Releases: ashvardanian/SimSIMD
Release v5.6.1
Release: v5.6.1 [skip ci]
Release v5.6.0
Release: v5.6.0 [skip ci]
Release v5.5.1
Release: v5.5.1 [skip ci]
Release v5.5.0
Release: v5.5.0 [skip ci]
Release v5.4.4
Release: v5.4.4 [skip ci]
Release v5.4.3
Release: v5.4.3 [skip ci]
Release v5.4.2
Release: v5.4.2 [skip ci]
Release v5.4.1
Release: v5.4.1 [skip ci]
v5.4: 100x – 10'000x More Accurate Cosine Distance
The cosine similarity is the most common and straightforward metric used in machine learning and information retrieval. Interestingly, there are multiple ways to shoot yourself in the foot when computing it. The cosine similarity is the inverse of the cosine distance, which is the cosine of the angle between two vectors.
In NumPy terms, SimSIMD implementation is similar to:
import numpy as np
def cos_numpy(a: np.ndarray, b: np.ndarray) -> float:
ab, a2, b2 = np.dot(a, b), np.dot(a, a), np.dot(b, b) # Fused in SimSIMD
if a2 == 0 and b2 == 0: result = 0 # Same in SciPy
elif ab == 0: result = 1 # Division by zero error in SciPy
else: result = 1 - ab / (sqrt(a2) * sqrt(b2)) # Bigger rounding error in SciPy
return result
In SciPy, however, the cosine distance is computed as 1 - ab / np.sqrt(a2 * b2)
. It handles the edge case of a zero and non-zero argument pair differently, resulting in a division by zero error. It's not only less efficient, but also less accurate, given how the reciprocal square roots are computed. The C standard library provides the sqrt
function, which is generally very accurate, but slow. The rsqrt
in-hardware implementations are faster, but have different accuracy characteristics.
- SSE
rsqrtps
and AVXvrsqrtps
:$1.5 \times 2^{-12}$ maximal error. - AVX-512
vrsqrt14pd
instruction:$2^{-14}$ maximal error. - NEON
frsqrte
instruction has no clear error bounds.
To overcome the limitations of the rsqrt
instruction, SimSIMD uses the Newton-Raphson iteration to refine the initial estimate for high-precision floating-point numbers. It can be defined as:
On 1536-dimensional inputs on Intel Sapphire Rapids CPU a single such iteration can result in a 2-3 orders of magnitude relative error reduction:
Datatype | NumPy Error | SimSIMD w/out Iteration | SimSIMD |
---|---|---|---|
bfloat16 |
1.89e-08 ± 1.59e-08 | 3.07e-07 ± 3.09e-07 | 3.53e-09 ± 2.70e-09 |
float16 |
1.67e-02 ± 1.44e-02 | 2.68e-05 ± 1.95e-05 | 2.02e-05 ± 1.39e-05 |
float32 |
2.21e-08 ± 1.65e-08 | 3.47e-07 ± 3.49e-07 | 3.77e-09 ± 2.84e-09 |
float64 |
0.00e+00 ± 0.00e+00 | 3.80e-07 ± 4.50e-07 | 1.35e-11 ± 1.85e-11 |
On Arm:
Datatype | NumPy Error | SimSIMD w/out Iteration | SimSIMD |
---|---|---|---|
bfloat16 |
1.55e-09 ± 1.27e-09 | 2.79e-05 ± 3.60e-05 | 2.09e-08 ± 1.50e-08 |
float16 |
1.05e-05 ± 9.99e-06 | 4.97e-05 ± 4.33e-05 | 4.81e-05 ± 3.38e-05 |
float32 |
2.37e-09 ± 1.88e-09 | 1.79e-05 ± 1.69e-05 | 9.02e-09 ± 7.16e-09 |
float64 |
0.00e+00 ± 0.00e+00 | 2.54e-05 ± 2.32e-05 | 2.23e-13 ± 4.67e-13 |
Benchmarks
x86: Intel Sapphire Rapids
Baseline
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
| | Metric | NDim | DType | Baseline Error | SimSIMD Error | Accurate Duration | Baseline Duration | SimSIMD Duration | SimSIMD Speedup |
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
| 0 | cosine | 11 | bfloat16 | 9.86e-09 ± 1.58e-08 | 3.35e-04 ± 4.42e-04 | 2.16e+04 ± 1.18e+03 | 2.42e+04 ± 2.79e+03 | 2.51e+03 ± 4.17e+02 | 9.90x ± 2.24x |
| 1 | cosine | 11 | float16 | 1.46e-04 ± 1.83e-04 | 5.09e-04 ± 7.05e-04 | 2.16e+04 ± 1.27e+03 | 2.53e+04 ± 2.54e+03 | 1.15e+03 ± 9.31e+01 | 22.17x ± 1.76x |
| 2 | cosine | 11 | float32 | 2.13e-08 ± 2.20e-08 | 2.69e-04 ± 4.08e-04 | 2.14e+04 ± 1.51e+03 | 2.37e+04 ± 3.52e+03 | 1.96e+03 ± 6.73e+03 | 23.21x ± 3.90x |
| 3 | cosine | 11 | float64 | 0.00e+00 ± 0.00e+00 | 4.51e-04 ± 5.78e-04 | 2.57e+04 ± 1.16e+04 | 1.57e+04 ± 1.55e+03 | 1.51e+03 ± 9.03e+02 | 11.57x ± 2.21x |
| 4 | cosine | 11 | int8 | 0.00e+00 ± 0.00e+00 | 4.56e-04 ± 5.30e-04 | 1.59e+04 ± 6.32e+02 | 1.60e+04 ± 5.11e+02 | 1.72e+03 ± 6.12e+02 | 9.89x ± 1.86x |
| 5 | cosine | 97 | bfloat16 | 6.71e-09 ± 7.90e-09 | 1.31e-04 ± 1.47e-04 | 2.14e+04 ± 9.71e+02 | 2.36e+04 ± 4.33e+02 | 2.47e+03 ± 3.95e+02 | 9.82x ± 1.71x |
| 6 | cosine | 97 | float16 | 3.00e-05 ± 2.42e-05 | 1.00e-04 ± 7.79e-05 | 2.15e+04 ± 1.70e+03 | 2.70e+04 ± 2.02e+03 | 1.18e+03 ± 8.51e+01 | 22.89x ± 2.06x |
| 7 | cosine | 97 | float32 | 6.84e-09 ± 5.72e-09 | 1.13e-04 ± 1.19e-04 | 2.19e+04 ± 1.84e+03 | 2.33e+04 ± 1.91e+03 | 1.04e+03 ± 9.38e+01 | 22.44x ± 2.38x |
| 8 | cosine | 97 | float64 | 0.00e+00 ± 0.00e+00 | 9.69e-05 ± 1.54e-04 | 2.13e+04 ± 2.00e+03 | 1.54e+04 ± 1.39e+03 | 1.30e+03 ± 1.20e+02 | 11.92x ± 1.47x |
| 9 | cosine | 97 | int8 | 0.00e+00 ± 0.00e+00 | 1.14e-04 ± 1.33e-04 | 1.56e+04 ± 4.34e+02 | 1.60e+04 ± 3.64e+02 | 1.57e+03 ± 2.48e+02 | 10.43x ± 1.55x |
| 10 | cosine | 1536 | bfloat16 | 1.55e-09 ± 1.27e-09 | 2.79e-05 ± 3.60e-05 | 2.78e+04 ± 1.54e+03 | 2.73e+04 ± 4.66e+02 | 2.83e+03 ± 3.41e+02 | 9.82x ± 1.25x |
| 11 | cosine | 1536 | float16 | 1.05e-05 ± 9.99e-06 | 4.97e-05 ± 4.33e-05 | 2.56e+04 ± 2.02e+03 | 5.44e+04 ± 1.77e+03 | 1.48e+03 ± 1.78e+02 | 37.23x ± 4.42x |
| 12 | cosine | 1536 | float32 | 2.37e-09 ± 1.88e-09 | 1.79e-05 ± 1.69e-05 | 2.49e+04 ± 1.29e+03 | 2.63e+04 ± 5.41e+03 | 1.56e+03 ± 3.41e+02 | 17.46x ± 3.77x |
| 13 | cosine | 1536 | float64 | 0.00e+00 ± 0.00e+00 | 2.54e-05 ± 2.32e-05 | 2.51e+04 ± 2.21e+03 | 1.87e+04 ± 2.87e+02 | 2.39e+03 ± 6.24e+02 | 8.25x ± 1.68x |
| 14 | cosine | 1536 | int8 | 0.00e+00 ± 0.00e+00 | 3.06e-05 ± 3.12e-05 | 1.91e+04 ± 1.14e+03 | 2.18e+04 ± 1.17e+03 | 1.72e+03 ± 2.66e+02 | 13.00x ± 2.13x |
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
With 1 Iteration
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
| | Metric | NDim | DType | Baseline Error | SimSIMD Error | Accurate Duration | Baseline Duration | SimSIMD Duration | SimSIMD Speedup |
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
| 0 | cosine | 11 | bfloat16 | 3.04e-08 ± 2.53e-08 | 3.63e-09 ± 6.75e-09 | 1.24e+04 ± 8.90e+02 | 7.19e+03 ± 4.48e+02 | 2.75e+03 ± 7.66e+02 | 2.71x ± 0.37x |
| 1 | cosine | 11 | float16 | 2.61e-04 ± 2.45e-04 | 2.12e-04 ± 3.90e-04 | 1.24e+04 ± 9.59e+02 | 9.28e+03 ± 1.72e+03 | 1.27e+03 ± 5.28e+02 | 7.65x ± 1.19x |
| 2 | cosine | 11 | float32 | 2.91e-08 ± 1.81e-08 | 1.20e-08 ± 1.26e-08 | 1.36e+04 ± 4.00e+03 | 8.32e+03 ± 2.87e+03 | 1.09e+03 ± 1.63e+02 | 7.55x ± 1.54x |
| 3 | cosine | 11 | float64 | 0.00e+00 ± 0.00e+00 | 3.35e-10 ± 4.33e-10 | 1.35e+04 ± 7.24e+03 | 6.02e+03 ± 8.72e+02 | 1.58e+03 ± 1.18e+03 | 4.45x ± 1.08x |
| 4 | cosine | 11 | int8 | 0.00e+00 ± 0.00e+00 | 2.81e-03 ± 1.80e-02 | 9.17e+03 ± 4.17e+03 | 8.02e+03 ± 1.51e+03 | 1.76e+03 ± 2.00e+02 | 4.56x ± 0.81x |
| 5 | cosine | 97 | bfloat16 | 2.02e-08 ± 1.25e-08 | 3.44e-09 ± 4.63e-09 | 1.34e+04 ± 3.38e+03 | 7.79e+03 ± 2.84e+03 | 2.55e+03 ± 1.09e+02 | 3.05x ± 1.07x |
| 6 | cosine | 97 | float16 | 1.97e-04 ± 1.18e-04 | 5.37e-05 ± 4.52e-05 | 1.26e+04 ± 1.11e+03 | 1.06e+04 ± 2.95e+03 | 1.19e+03 ± 1.46e+02 | 8.90x ± 1.93x |
| 7 | cosine | 97 | float32 | 2.39e-08 ± 1.36e-08 | 5.66e-09 ± 4.83e-09 | 1.31e+04 ± 3.11e+03 | 7.78e+03 ± 1.25e+03 | 1.26e+03 ± 8.08e+02 | 6.92x ± 1.63x |
| 8 | cosine | 97 | float64 | 0.00e+00 ± 0.00e+00 | 6.84e-11 ± 1.10e-10 | 1.25e+04 ± 1.21e+03 | 6.51e+03 ± 1.69e+03 | 1.37e+03 ± 3.63e+02 | 4.89x ± 1.28x |
| 9 | cosine | 97 | int8 | 0.00e+00 ± 0.00e+00 | 1.93e-03 ± 4.20e-03 | 8.37e+03 ± 1.87e+03 | 7.89e+03 ± 7.80e+02 | 2.02e+03 ± 1.66e+03 | 4.34x ± 0.69x |
| 10 | cosine | 1536 | bfloat16 | 2.25e-08 ± 1.61e-08 | 3.53e-09 ± 2.70e-09 | 1.52e+04 ± 2.81e+03 | 8.28e+03 ± 5.26e+02 | 3.07e+03 ± 1.32e+02 | 2.70x ± 0.20x |
| 11 | cosine | 1536 | float16 | 2.00e-02 ± 1.76e-02 | 2.02e-05 ± 1.39e-05 | 1.43e+04 ± 2.37e+03 | 2.74e+04 ± 3.46e+03 | 1.38e+03 ± 1.25e+02 | 19.98x ± 2.35x |
| 12 | cosine | 1536 | float32 | 2.24e-08 ± 1.40e-08 | 3.77e-09 ± 2.84e-09 | 1.36e+04 ± 2.64e+03 | 8.64e+03 ± 8.04e+02 | 1.23e+03 ± 8.10e+01 | 7.06x ± 0.72x |
| 13 | cosine | 1536 | float64 | 0.00e+00 ± 0.00e+00 | 1.35e-11 ± 1.85e-11 | 1.34e+04 ± 1.27e+03 | 7.31e+03 ± 8.12e+02 | 1.98e+03 ± 2.02e+03 | 4.49x ± 1.01x |
| 14 | cosine | 1536 | int8 | 0.00e+00 ± 0.00e+00 | 4.20e-04 ± 4.88e-04 | 9.47e+03 ± 2.09e+03 | 1.01e+04 ± 1.11e+03 | 1.95e+03 ± 1.04e+02 | 5.19x ± 0.56x |
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
Arm: AWS Graviton 3
Baseline
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
| | Metric | NDim | DType | Baseline Error | SimSIMD Error | Accurate Duration | Baseline Duration | SimSIMD Duration | SimSIMD Speedup |
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-------...
v5.3: 5x Faster Set Intersections with SVE2, NEON, and AVX-512
- Need to accelerate TF-IDF search ranking?
- Joining large tables in an OLAP database?
- Implementing graph algorithms?
Chances are - you need fast set intersections! It's one of the most common operations in programming, yet one of the hardest to accelerate with SIMD! This PR improves existing kernels and adds new ones for fast set intersections of sorted arrays of unique u16
and u32
values. Now, SimSIMD is not practically the only production codebase to use Arm SVE, but also one of the first to use the new SVE2 instructions available on Graviton 4 AWS CPUs, and coming to Nvidia's Grace Hopper, Microsoft Cobalt, and Google Axios! So upgrade to v5.2 and let's make the databases & search systems go 5x faster!
Speedups on x86
The new AVX-512 variant shows significant improvements in pairs/s across all benchmarks:
- For
|A|=128
,|B|=128
,|A∩B|=1
, pairs/s increased- from 1.14M/s in the old implementation
- to 7.73M/s in the new one, a 6.7x improvement.
- At
|A∩B|=64
, the pairs/s rose:- from 1.13M/s in the old implementation
- to 8.19M/s in the new one, a 7.2x gain.
- For larger sets, like
|A|=1024
and|B|=8192
, with|A∩B|=10
, pairs/s increased:- from 130.18k/s in the old implementation
- to 194.50k/s in the new one, a 49% gain.
However, in cases like |A|=128
, |B|=8192
, with |A∩B|=64
, pairs/s slightly decreased from 369.7k/s to 222.9k/s. Overall, the new implementation outperforms the previous one, and no case is worse than the serial version.
Speedups on Arm
On the Arm architecture, similar performance gains were achieved using the NEON and SVE2 instruction sets:
- The optimized NEON implementation showed a 3.9x improvement in pairs/s for
|A|=128, |B|=128, |A∩B|=1
, going from 1.62M/s to 5.12M/s. - For
|A∩B|=64
in the same configuration, performance improved from 1.60M/s to 5.51M/s, showing a 3.4x gain. - The SVE2 implementation also outperformed the previous SVE setup, achieving 5.6M/s (NEON) versus 1.27M/s (SVE) for
|A|=128, |B|=128, |A∩B|=1
, a 4.4x improvement. - In larger datasets, such as
|A|=1024
and|B|=8192
, the pairs/s increased from 49.3k/s to 110.03k/s, with NEON, and further to 109.47k/s with SVE2, nearly doubling the performance.
x86 Benchmarking Setup
The benchmarking was conducted on r7iz
AWS instances with Intel Sapphire Rapids CPUs.
Running build_release/simsimd_bench
Run on (16 X 3900.51 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 2048 KiB (x8)
L3 Unified 61440 KiB (x1)
Old Serial Baselines
-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1 567 ns 567 ns 24785678 pairs=1.76263M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1 567 ns 567 ns 24598141 pairs=1.76286M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1 569 ns 569 ns 24741572 pairs=1.75684M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1 568 ns 568 ns 24871638 pairs=1.76073M/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1 2508 ns 2508 ns 5591748 pairs=398.803k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1 2509 ns 2509 ns 5589871 pairs=398.535k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1 2530 ns 2530 ns 5564535 pairs=395.33k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1 2522 ns 2522 ns 5532306 pairs=396.447k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1 4791 ns 4791 ns 2920833 pairs=208.737k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1 4800 ns 4800 ns 2923139 pairs=208.346k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1 4821 ns 4820 ns 2906942 pairs=207.448k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1 4843 ns 4843 ns 2897334 pairs=206.504k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1 4484 ns 4484 ns 3122873 pairs=223.023k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1 4479 ns 4479 ns 3124662 pairs=223.261k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1 4484 ns 4484 ns 3125584 pairs=223.034k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1 4500 ns 4500 ns 3104588 pairs=222.229k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1 20118 ns 20117 ns 696244 pairs=49.7084k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1 20134 ns 20134 ns 696160 pairs=49.6682k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1 20125 ns 20124 ns 695799 pairs=49.6911k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1 20102 ns 20102 ns 695762 pairs=49.7464k/s
Existing AVX-512 Implementation
-------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=1>/min_time:10.000/threads:1 875 ns 875 ns 16248886 pairs=1.14342M/s
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=6>/min_time:10.000/threads:1 873 ns 873 ns 16081249 pairs=1.14555M/s
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=64>/min_time:10.000/threads:1 882 ns 882 ns 15851609 pairs=1.13354M/s
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=121>/min_time:10.000/threads:1 916 ns 916 ns 15282595 pairs=1091.32k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1 955 ns 955 ns 14660187 pairs=1047.53k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1 955 ns 955 ns 14663375 pairs=1047.57k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1 952 ns 952 ns 14702462 pairs=1050.17k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1 949 ns 949 ns 14743103 pairs=1053.59k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1 2718 ns 2718 ns 5168053 pairs=367.871k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1 2698 ns 2698 ns 5155819 pairs=370.664k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1 2705 ns 2705 ns 5203675 pairs=369.686k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1 2693 ns 2693 ns 5187007 pairs=371.377k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1 7310 ns 7310 ns 1910292 pairs=136.8k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1 7312 ns 7312 ns 1913190 pairs=136.759k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1 7365 ns 7365 ns 1900946 pairs=135.781k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1 7439 ns 7439 ns 1882319 pairs=134.43k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1 7682 ns 7681 ns 1821784 pairs=130.183k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1 7695 ns 7695 ns 1821861 pairs=129.955k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1 7643 ns 7643 ns 1829955 pairs=130.842k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1 7617 ns 7617 ns 1838612 pairs=131.279k/s
New AVX-512 Implementation
--------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------
intersect_u16_ice<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1 129 ns 129 ns 101989513 pairs=7.72559M/s
intersect_u16_ice<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1 134 ns 134 ns 107140278 pairs=7.46949M/s
intersect_u16_ice<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1 122 ns 122 ns 113134485 pairs=8.18634M/s
inter...