Skip to content

Releases: ashvardanian/SimSIMD

Release v5.6.1

09 Oct 15:01
Compare
Choose a tag to compare

Release: v5.6.1 [skip ci]

Release v5.6.0

06 Oct 22:06
Compare
Choose a tag to compare

Release: v5.6.0 [skip ci]

Release v5.5.1

06 Oct 01:43
Compare
Choose a tag to compare

Release: v5.5.1 [skip ci]

Release v5.5.0

04 Oct 20:39
Compare
Choose a tag to compare

Release: v5.5.0 [skip ci]

Release v5.4.4

03 Oct 01:29
Compare
Choose a tag to compare

Release: v5.4.4 [skip ci]

Release v5.4.3

28 Sep 04:19
Compare
Choose a tag to compare

Release: v5.4.3 [skip ci]

Release v5.4.2

21 Sep 00:27
Compare
Choose a tag to compare

Release: v5.4.2 [skip ci]

Release v5.4.1

19 Sep 02:54
Compare
Choose a tag to compare

Release: v5.4.1 [skip ci]

v5.4: 100x – 10'000x More Accurate Cosine Distance

18 Sep 01:45
Compare
Choose a tag to compare

The cosine similarity is the most common and straightforward metric used in machine learning and information retrieval. Interestingly, there are multiple ways to shoot yourself in the foot when computing it. The cosine similarity is the inverse of the cosine distance, which is the cosine of the angle between two vectors.

$$\text{CosineSimilarity}(a, b) = \frac{a \cdot b}{\|a\| \cdot \|b\|}$$ $$\text{CosineDistance}(a, b) = 1 - \frac{a \cdot b}{\|a\| \cdot \|b\|}$$

In NumPy terms, SimSIMD implementation is similar to:

import numpy as np

def cos_numpy(a: np.ndarray, b: np.ndarray) -> float:
    ab, a2, b2 = np.dot(a, b), np.dot(a, a), np.dot(b, b) # Fused in SimSIMD
    if a2 == 0 and b2 == 0: result = 0                    # Same in SciPy
    elif ab == 0: result = 1                              # Division by zero error in SciPy
    else: result = 1 - ab / (sqrt(a2) * sqrt(b2))         # Bigger rounding error in SciPy
    return result

In SciPy, however, the cosine distance is computed as 1 - ab / np.sqrt(a2 * b2). It handles the edge case of a zero and non-zero argument pair differently, resulting in a division by zero error. It's not only less efficient, but also less accurate, given how the reciprocal square roots are computed. The C standard library provides the sqrt function, which is generally very accurate, but slow. The rsqrt in-hardware implementations are faster, but have different accuracy characteristics.

  • SSE rsqrtps and AVX vrsqrtps: $1.5 \times 2^{-12}$ maximal error.
  • AVX-512 vrsqrt14pd instruction: $2^{-14}$ maximal error.
  • NEON frsqrte instruction has no clear error bounds.

To overcome the limitations of the rsqrt instruction, SimSIMD uses the Newton-Raphson iteration to refine the initial estimate for high-precision floating-point numbers. It can be defined as:

$$x_{n+1} = x_n \cdot (3 - x_n \cdot x_n) / 2$$

On 1536-dimensional inputs on Intel Sapphire Rapids CPU a single such iteration can result in a 2-3 orders of magnitude relative error reduction:

Datatype NumPy Error SimSIMD w/out Iteration SimSIMD
bfloat16 1.89e-08 ± 1.59e-08 3.07e-07 ± 3.09e-07 3.53e-09 ± 2.70e-09
float16 1.67e-02 ± 1.44e-02 2.68e-05 ± 1.95e-05 2.02e-05 ± 1.39e-05
float32 2.21e-08 ± 1.65e-08 3.47e-07 ± 3.49e-07 3.77e-09 ± 2.84e-09
float64 0.00e+00 ± 0.00e+00 3.80e-07 ± 4.50e-07 1.35e-11 ± 1.85e-11

On Arm:

Datatype NumPy Error SimSIMD w/out Iteration SimSIMD
bfloat16 1.55e-09 ± 1.27e-09 2.79e-05 ± 3.60e-05 2.09e-08 ± 1.50e-08
float16 1.05e-05 ± 9.99e-06 4.97e-05 ± 4.33e-05 4.81e-05 ± 3.38e-05
float32 2.37e-09 ± 1.88e-09 1.79e-05 ± 1.69e-05 9.02e-09 ± 7.16e-09
float64 0.00e+00 ± 0.00e+00 2.54e-05 ± 2.32e-05 2.23e-13 ± 4.67e-13

Benchmarks

x86: Intel Sapphire Rapids

Baseline

+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
|    | Metric | NDim |  DType   |   Baseline Error    |    SimSIMD Error    |  Accurate Duration  |  Baseline Duration  |  SimSIMD Duration   | SimSIMD Speedup |
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
| 0  | cosine |  11  | bfloat16 | 9.86e-09 ± 1.58e-08 | 3.35e-04 ± 4.42e-04 | 2.16e+04 ± 1.18e+03 | 2.42e+04 ± 2.79e+03 | 2.51e+03 ± 4.17e+02 |  9.90x ± 2.24x  |
| 1  | cosine |  11  | float16  | 1.46e-04 ± 1.83e-04 | 5.09e-04 ± 7.05e-04 | 2.16e+04 ± 1.27e+03 | 2.53e+04 ± 2.54e+03 | 1.15e+03 ± 9.31e+01 | 22.17x ± 1.76x  |
| 2  | cosine |  11  | float32  | 2.13e-08 ± 2.20e-08 | 2.69e-04 ± 4.08e-04 | 2.14e+04 ± 1.51e+03 | 2.37e+04 ± 3.52e+03 | 1.96e+03 ± 6.73e+03 | 23.21x ± 3.90x  |
| 3  | cosine |  11  | float64  | 0.00e+00 ± 0.00e+00 | 4.51e-04 ± 5.78e-04 | 2.57e+04 ± 1.16e+04 | 1.57e+04 ± 1.55e+03 | 1.51e+03 ± 9.03e+02 | 11.57x ± 2.21x  |
| 4  | cosine |  11  |   int8   | 0.00e+00 ± 0.00e+00 | 4.56e-04 ± 5.30e-04 | 1.59e+04 ± 6.32e+02 | 1.60e+04 ± 5.11e+02 | 1.72e+03 ± 6.12e+02 |  9.89x ± 1.86x  |
| 5  | cosine |  97  | bfloat16 | 6.71e-09 ± 7.90e-09 | 1.31e-04 ± 1.47e-04 | 2.14e+04 ± 9.71e+02 | 2.36e+04 ± 4.33e+02 | 2.47e+03 ± 3.95e+02 |  9.82x ± 1.71x  |
| 6  | cosine |  97  | float16  | 3.00e-05 ± 2.42e-05 | 1.00e-04 ± 7.79e-05 | 2.15e+04 ± 1.70e+03 | 2.70e+04 ± 2.02e+03 | 1.18e+03 ± 8.51e+01 | 22.89x ± 2.06x  |
| 7  | cosine |  97  | float32  | 6.84e-09 ± 5.72e-09 | 1.13e-04 ± 1.19e-04 | 2.19e+04 ± 1.84e+03 | 2.33e+04 ± 1.91e+03 | 1.04e+03 ± 9.38e+01 | 22.44x ± 2.38x  |
| 8  | cosine |  97  | float64  | 0.00e+00 ± 0.00e+00 | 9.69e-05 ± 1.54e-04 | 2.13e+04 ± 2.00e+03 | 1.54e+04 ± 1.39e+03 | 1.30e+03 ± 1.20e+02 | 11.92x ± 1.47x  |
| 9  | cosine |  97  |   int8   | 0.00e+00 ± 0.00e+00 | 1.14e-04 ± 1.33e-04 | 1.56e+04 ± 4.34e+02 | 1.60e+04 ± 3.64e+02 | 1.57e+03 ± 2.48e+02 | 10.43x ± 1.55x  |
| 10 | cosine | 1536 | bfloat16 | 1.55e-09 ± 1.27e-09 | 2.79e-05 ± 3.60e-05 | 2.78e+04 ± 1.54e+03 | 2.73e+04 ± 4.66e+02 | 2.83e+03 ± 3.41e+02 |  9.82x ± 1.25x  |
| 11 | cosine | 1536 | float16  | 1.05e-05 ± 9.99e-06 | 4.97e-05 ± 4.33e-05 | 2.56e+04 ± 2.02e+03 | 5.44e+04 ± 1.77e+03 | 1.48e+03 ± 1.78e+02 | 37.23x ± 4.42x  |
| 12 | cosine | 1536 | float32  | 2.37e-09 ± 1.88e-09 | 1.79e-05 ± 1.69e-05 | 2.49e+04 ± 1.29e+03 | 2.63e+04 ± 5.41e+03 | 1.56e+03 ± 3.41e+02 | 17.46x ± 3.77x  |
| 13 | cosine | 1536 | float64  | 0.00e+00 ± 0.00e+00 | 2.54e-05 ± 2.32e-05 | 2.51e+04 ± 2.21e+03 | 1.87e+04 ± 2.87e+02 | 2.39e+03 ± 6.24e+02 |  8.25x ± 1.68x  |
| 14 | cosine | 1536 |   int8   | 0.00e+00 ± 0.00e+00 | 3.06e-05 ± 3.12e-05 | 1.91e+04 ± 1.14e+03 | 2.18e+04 ± 1.17e+03 | 1.72e+03 ± 2.66e+02 | 13.00x ± 2.13x  |
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+

With 1 Iteration

+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
|    | Metric | NDim |  DType   |   Baseline Error    |    SimSIMD Error    |  Accurate Duration  |  Baseline Duration  |  SimSIMD Duration   | SimSIMD Speedup |
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
| 0  | cosine |  11  | bfloat16 | 3.04e-08 ± 2.53e-08 | 3.63e-09 ± 6.75e-09 | 1.24e+04 ± 8.90e+02 | 7.19e+03 ± 4.48e+02 | 2.75e+03 ± 7.66e+02 |  2.71x ± 0.37x  |
| 1  | cosine |  11  | float16  | 2.61e-04 ± 2.45e-04 | 2.12e-04 ± 3.90e-04 | 1.24e+04 ± 9.59e+02 | 9.28e+03 ± 1.72e+03 | 1.27e+03 ± 5.28e+02 |  7.65x ± 1.19x  |
| 2  | cosine |  11  | float32  | 2.91e-08 ± 1.81e-08 | 1.20e-08 ± 1.26e-08 | 1.36e+04 ± 4.00e+03 | 8.32e+03 ± 2.87e+03 | 1.09e+03 ± 1.63e+02 |  7.55x ± 1.54x  |
| 3  | cosine |  11  | float64  | 0.00e+00 ± 0.00e+00 | 3.35e-10 ± 4.33e-10 | 1.35e+04 ± 7.24e+03 | 6.02e+03 ± 8.72e+02 | 1.58e+03 ± 1.18e+03 |  4.45x ± 1.08x  |
| 4  | cosine |  11  |   int8   | 0.00e+00 ± 0.00e+00 | 2.81e-03 ± 1.80e-02 | 9.17e+03 ± 4.17e+03 | 8.02e+03 ± 1.51e+03 | 1.76e+03 ± 2.00e+02 |  4.56x ± 0.81x  |
| 5  | cosine |  97  | bfloat16 | 2.02e-08 ± 1.25e-08 | 3.44e-09 ± 4.63e-09 | 1.34e+04 ± 3.38e+03 | 7.79e+03 ± 2.84e+03 | 2.55e+03 ± 1.09e+02 |  3.05x ± 1.07x  |
| 6  | cosine |  97  | float16  | 1.97e-04 ± 1.18e-04 | 5.37e-05 ± 4.52e-05 | 1.26e+04 ± 1.11e+03 | 1.06e+04 ± 2.95e+03 | 1.19e+03 ± 1.46e+02 |  8.90x ± 1.93x  |
| 7  | cosine |  97  | float32  | 2.39e-08 ± 1.36e-08 | 5.66e-09 ± 4.83e-09 | 1.31e+04 ± 3.11e+03 | 7.78e+03 ± 1.25e+03 | 1.26e+03 ± 8.08e+02 |  6.92x ± 1.63x  |
| 8  | cosine |  97  | float64  | 0.00e+00 ± 0.00e+00 | 6.84e-11 ± 1.10e-10 | 1.25e+04 ± 1.21e+03 | 6.51e+03 ± 1.69e+03 | 1.37e+03 ± 3.63e+02 |  4.89x ± 1.28x  |
| 9  | cosine |  97  |   int8   | 0.00e+00 ± 0.00e+00 | 1.93e-03 ± 4.20e-03 | 8.37e+03 ± 1.87e+03 | 7.89e+03 ± 7.80e+02 | 2.02e+03 ± 1.66e+03 |  4.34x ± 0.69x  |
| 10 | cosine | 1536 | bfloat16 | 2.25e-08 ± 1.61e-08 | 3.53e-09 ± 2.70e-09 | 1.52e+04 ± 2.81e+03 | 8.28e+03 ± 5.26e+02 | 3.07e+03 ± 1.32e+02 |  2.70x ± 0.20x  |
| 11 | cosine | 1536 | float16  | 2.00e-02 ± 1.76e-02 | 2.02e-05 ± 1.39e-05 | 1.43e+04 ± 2.37e+03 | 2.74e+04 ± 3.46e+03 | 1.38e+03 ± 1.25e+02 | 19.98x ± 2.35x  |
| 12 | cosine | 1536 | float32  | 2.24e-08 ± 1.40e-08 | 3.77e-09 ± 2.84e-09 | 1.36e+04 ± 2.64e+03 | 8.64e+03 ± 8.04e+02 | 1.23e+03 ± 8.10e+01 |  7.06x ± 0.72x  |
| 13 | cosine | 1536 | float64  | 0.00e+00 ± 0.00e+00 | 1.35e-11 ± 1.85e-11 | 1.34e+04 ± 1.27e+03 | 7.31e+03 ± 8.12e+02 | 1.98e+03 ± 2.02e+03 |  4.49x ± 1.01x  |
| 14 | cosine | 1536 |   int8   | 0.00e+00 ± 0.00e+00 | 4.20e-04 ± 4.88e-04 | 9.47e+03 ± 2.09e+03 | 1.01e+04 ± 1.11e+03 | 1.95e+03 ± 1.04e+02 |  5.19x ± 0.56x  |
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+

Arm: AWS Graviton 3

Baseline

+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-----------------+
|    | Metric | NDim |  DType   |   Baseline Error    |    SimSIMD Error    |  Accurate Duration  |  Baseline Duration  |  SimSIMD Duration   | SimSIMD Speedup |
+----+--------+------+----------+---------------------+---------------------+---------------------+---------------------+---------------------+-------...
Read more

v5.3: 5x Faster Set Intersections with SVE2, NEON, and AVX-512

16 Sep 05:46
Compare
Choose a tag to compare

set-intersections

  • Need to accelerate TF-IDF search ranking?
  • Joining large tables in an OLAP database?
  • Implementing graph algorithms?

Chances are - you need fast set intersections! It's one of the most common operations in programming, yet one of the hardest to accelerate with SIMD! This PR improves existing kernels and adds new ones for fast set intersections of sorted arrays of unique u16 and u32 values. Now, SimSIMD is not practically the only production codebase to use Arm SVE, but also one of the first to use the new SVE2 instructions available on Graviton 4 AWS CPUs, and coming to Nvidia's Grace Hopper, Microsoft Cobalt, and Google Axios! So upgrade to v5.2 and let's make the databases & search systems go 5x faster!

Speedups on x86

The new AVX-512 variant shows significant improvements in pairs/s across all benchmarks:

  • For |A|=128, |B|=128, |A∩B|=1, pairs/s increased
    • from 1.14M/s in the old implementation
    • to 7.73M/s in the new one, a 6.7x improvement.
  • At |A∩B|=64, the pairs/s rose:
    • from 1.13M/s in the old implementation
    • to 8.19M/s in the new one, a 7.2x gain.
  • For larger sets, like |A|=1024 and |B|=8192, with |A∩B|=10, pairs/s increased:
    • from 130.18k/s in the old implementation
    • to 194.50k/s in the new one, a 49% gain.

However, in cases like |A|=128, |B|=8192, with |A∩B|=64, pairs/s slightly decreased from 369.7k/s to 222.9k/s. Overall, the new implementation outperforms the previous one, and no case is worse than the serial version.

Speedups on Arm

On the Arm architecture, similar performance gains were achieved using the NEON and SVE2 instruction sets:

  • The optimized NEON implementation showed a 3.9x improvement in pairs/s for |A|=128, |B|=128, |A∩B|=1, going from 1.62M/s to 5.12M/s.
  • For |A∩B|=64 in the same configuration, performance improved from 1.60M/s to 5.51M/s, showing a 3.4x gain.
  • The SVE2 implementation also outperformed the previous SVE setup, achieving 5.6M/s (NEON) versus 1.27M/s (SVE) for |A|=128, |B|=128, |A∩B|=1, a 4.4x improvement.
  • In larger datasets, such as |A|=1024 and |B|=8192, the pairs/s increased from 49.3k/s to 110.03k/s, with NEON, and further to 109.47k/s with SVE2, nearly doubling the performance.

x86 Benchmarking Setup

The benchmarking was conducted on r7iz AWS instances with Intel Sapphire Rapids CPUs.

Running build_release/simsimd_bench
Run on (16 X 3900.51 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 2048 KiB (x8)
  L3 Unified 61440 KiB (x1)

Old Serial Baselines

-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1             567 ns          567 ns     24785678 pairs=1.76263M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1             567 ns          567 ns     24598141 pairs=1.76286M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1            569 ns          569 ns     24741572 pairs=1.75684M/s
intersect_u16_serial<|A|=128,|B|=128,|A∩B|=121>/min_time:10.000/threads:1           568 ns          568 ns     24871638 pairs=1.76073M/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1           2508 ns         2508 ns      5591748 pairs=398.803k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1           2509 ns         2509 ns      5589871 pairs=398.535k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1          2530 ns         2530 ns      5564535 pairs=395.33k/s
intersect_u16_serial<|A|=128,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1         2522 ns         2522 ns      5532306 pairs=396.447k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1           4791 ns         4791 ns      2920833 pairs=208.737k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1           4800 ns         4800 ns      2923139 pairs=208.346k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1          4821 ns         4820 ns      2906942 pairs=207.448k/s
intersect_u16_serial<|A|=128,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1         4843 ns         4843 ns      2897334 pairs=206.504k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1         4484 ns         4484 ns      3122873 pairs=223.023k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1         4479 ns         4479 ns      3124662 pairs=223.261k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1        4484 ns         4484 ns      3125584 pairs=223.034k/s
intersect_u16_serial<|A|=1024,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1        4500 ns         4500 ns      3104588 pairs=222.229k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1        20118 ns        20117 ns       696244 pairs=49.7084k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1        20134 ns        20134 ns       696160 pairs=49.6682k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1       20125 ns        20124 ns       695799 pairs=49.6911k/s
intersect_u16_serial<|A|=1024,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1       20102 ns        20102 ns       695762 pairs=49.7464k/s

Existing AVX-512 Implementation

-------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------------------------
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=1>/min_time:10.000/threads:1             875 ns          875 ns     16248886 pairs=1.14342M/s
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=6>/min_time:10.000/threads:1             873 ns          873 ns     16081249 pairs=1.14555M/s
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=64>/min_time:10.000/threads:1            882 ns          882 ns     15851609 pairs=1.13354M/s
intersect_u16_ice<|A|=128d,|B|=128,|A∩B|=121>/min_time:10.000/threads:1           916 ns          916 ns     15282595 pairs=1091.32k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=1>/min_time:10.000/threads:1            955 ns          955 ns     14660187 pairs=1047.53k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=6>/min_time:10.000/threads:1            955 ns          955 ns     14663375 pairs=1047.57k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=64>/min_time:10.000/threads:1           952 ns          952 ns     14702462 pairs=1050.17k/s
intersect_u16_ice<|A|=128d,|B|=1024,|A∩B|=121>/min_time:10.000/threads:1          949 ns          949 ns     14743103 pairs=1053.59k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=1>/min_time:10.000/threads:1           2718 ns         2718 ns      5168053 pairs=367.871k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=6>/min_time:10.000/threads:1           2698 ns         2698 ns      5155819 pairs=370.664k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=64>/min_time:10.000/threads:1          2705 ns         2705 ns      5203675 pairs=369.686k/s
intersect_u16_ice<|A|=128d,|B|=8192,|A∩B|=121>/min_time:10.000/threads:1         2693 ns         2693 ns      5187007 pairs=371.377k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=10>/min_time:10.000/threads:1         7310 ns         7310 ns      1910292 pairs=136.8k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=51>/min_time:10.000/threads:1         7312 ns         7312 ns      1913190 pairs=136.759k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=512>/min_time:10.000/threads:1        7365 ns         7365 ns      1900946 pairs=135.781k/s
intersect_u16_ice<|A|=1024d,|B|=1024,|A∩B|=972>/min_time:10.000/threads:1        7439 ns         7439 ns      1882319 pairs=134.43k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=10>/min_time:10.000/threads:1         7682 ns         7681 ns      1821784 pairs=130.183k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=51>/min_time:10.000/threads:1         7695 ns         7695 ns      1821861 pairs=129.955k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=512>/min_time:10.000/threads:1        7643 ns         7643 ns      1829955 pairs=130.842k/s
intersect_u16_ice<|A|=1024d,|B|=8192,|A∩B|=972>/min_time:10.000/threads:1        7617 ns         7617 ns      1838612 pairs=131.279k/s

New AVX-512 Implementation

--------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------
intersect_u16_ice<|A|=128,|B|=128,|A∩B|=1>/min_time:10.000/threads:1             129 ns          129 ns    101989513 pairs=7.72559M/s
intersect_u16_ice<|A|=128,|B|=128,|A∩B|=6>/min_time:10.000/threads:1             134 ns          134 ns    107140278 pairs=7.46949M/s
intersect_u16_ice<|A|=128,|B|=128,|A∩B|=64>/min_time:10.000/threads:1            122 ns          122 ns    113134485 pairs=8.18634M/s
inter...
Read more