Skip to content

Commit

Permalink
Merge pull request #109 from ashvardanian/main-dev
Browse files Browse the repository at this point in the history
Improved Rust and JavaScript tooling
  • Loading branch information
ashvardanian authored Apr 8, 2024
2 parents 127ead1 + 1f1fd3a commit dbfe5cd
Show file tree
Hide file tree
Showing 7 changed files with 318 additions and 20 deletions.
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 7 additions & 4 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,7 @@ description = "Fastest SIMD-Accelerated Vector Similarity Functions for x86 and
version = "4.2.2"
edition = "2021"
license = "Apache-2.0"
authors = [
"Ash Vardanian <[email protected]>",
"Pedro Planel <[email protected]>",
]
authors = ["Ash Vardanian <[email protected]>"]
repository = "https://github.com/ashvardanian/SimSIMD"
documentation = "https://docs.rs/simsimd"
homepage = "https://ashvardanian.com/posts/simsimd-faster-scipy"
Expand Down Expand Up @@ -38,6 +35,12 @@ name = "sqeuclidean"
harness = false
path = "rust/benches/sqeuclidean.rs"

[profile.bench]
opt-level = 3 # Corresponds to -O3
lto = true # Enables Link Time Optimization for further optimizations
codegen-units = 1 # May increase compilation time but optimizes further
rpath = false # On some systems, setting this to false can help with optimizations

[dev-dependencies]
criterion = { version = "0.5.1" }
rand = { version = "0.8.5" }
Expand Down
35 changes: 28 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
# SimSIMD 📏

_Computing dot-products, similarity measures, and distances between low- and high-dimensional vectors is ubiquitous in Machine Learning, Scientific Computing, Geo-Spatial Analysis, and Information Retrieval.
![SimSIMD banner](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/SimSIMD.png?raw=true)

Computing dot-products, similarity measures, and distances between low- and high-dimensional vectors is ubiquitous in Machine Learning, Scientific Computing, Geo-Spatial Analysis, and Information Retrieval.
These algorithms generally have linear complexity in time, constant complexity in space, and are data-parallel.
In other words, it is easily parallelizable and vectorizable and often available in packages like BLAS and LAPACK, as well as higher-level `numpy` and `scipy` Python libraries.
Ironically, even with decades of evolution in compilers and numerical computing, [most libraries can be 3-200x slower than hardware potential][benchmarks] even on the most popular hardware, like 64-bit x86 and Arm CPUs.
SimSIMD attempts to fill that gap.
1️⃣ SimSIMD functions are practically as fast as `memcpy`.
2️⃣ SimSIMD [compiles to more platforms than NumPy (105 vs 35)][compatibility] and has more backends than most BLAS implementations._
2️⃣ SimSIMD [compiles to more platforms than NumPy (105 vs 35)][compatibility] and has more backends than most BLAS implementations.

[benchmarks]: https://ashvardanian.com/posts/simsimd-faster-scipy
[compatibility]: https://pypi.org/project/simsimd/#files
Expand Down Expand Up @@ -400,8 +402,7 @@ To install, choose one of the following options depending on your environment:
- `pnpm add simsimd`
- `bun install simsimd`

The package is distributed with prebuilt binaries for Node.js v10 and above for Linux (x86_64, arm64), macOS (x86_64, arm64), and Windows (i386, x86_64).
If your platform is not supported, you can build the package from the source via `npm run build`.
The package is distributed with prebuilt binaries, but if your platform is not supported, you can build the package from the source via `npm run build`.
This will automatically happen unless you install the package with the `--ignore-scripts` flag or use Bun.
After you install it, you will be able to call the SimSIMD functions on various `TypedArray` variants:

Expand All @@ -415,14 +416,34 @@ const distance = sqeuclidean(vectorA, vectorB);
console.log('Squared Euclidean Distance:', distance);
```

Other numeric types and precision levels are supported as well:
Other numeric types and precision levels are supported as well.
For double-precsion floating-point numbers, use `Float64Array`:

```js
const vectorA = new Float64Array([1.0, 2.0, 3.0]);
const vectorB = new Float64Array([4.0, 5.0, 6.0]);

const distance = cosine(vectorA, vectorB);
console.log('Cosine Similarity:', distance);
```

When doing machine learning and vector search with high-dimensional vectors you may want to quantize them to 8-bit integers.
You may want to project values from the $[-1, 1]$ range to the $[-100, 100]$ range and then cast them to `Uint8Array`:

```js
const quantizedVectorA = new Uint8Array(vectorA.map(v => (v * 100)));
const quantizedVectorB = new Uint8Array(vectorB.map(v => (v * 100)));
const distance = cosine(quantizedVectorA, quantizedVectorB);
```

A more extreme quantization case would be to use binary vectors.
You can map all positive values to `1` and all negative values and zero to `0`, packing eight values into a single byte.
After that, Hamming and Jaccard distances can be computed.

```js
const { toBinary, hamming } = require('simsimd');

const binaryVectorA = toBinary(vectorA);
const binaryVectorB = toBinary(vectorB);
const distance = hamming(binaryVectorA, binaryVectorB);
```

## Using SimSIMD in C
Expand Down
28 changes: 26 additions & 2 deletions javascript/simsimd.ts
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,27 @@ export const jensenshannon = (a: Float64Array | Float32Array, b: Float64Array |
return compiled.jensenshannon(a, b);
};

/**
* Quantizes a floating-point vector into a binary vector (1 for positive values, 0 for non-positive values) and packs the result into a Uint8Array, where each element represents 8 binary values from the original vector.
* This function is useful for preparing data for bitwise distance or similarity computations, such as Hamming or Jaccard indices.
*
* @param {Float32Array | Float64Array | Int8Array} vector The floating-point vector to be quantized and packed.
* @returns {Uint8Array} A Uint8Array where each byte represents 8 binary quantized values from the input vector.
*/
export const toBinary = (vector: Float32Array | Float64Array | Int8Array): Uint8Array => {
const byteLength = Math.ceil(vector.length / 8);
const packedVector = new Uint8Array(byteLength);

for (let i = 0; i < vector.length; i++) {
if (vector[i] > 0) {
const byteIndex = Math.floor(i / 8);
const bitPosition = 7 - (i % 8);
packedVector[byteIndex] |= (1 << bitPosition);
}
}

return packedVector;
};
export default {
dot,
inner,
Expand All @@ -111,10 +132,13 @@ export default {
jaccard,
kullbackleibler,
jensenshannon,
toBinary,
};

// utility functions to help find native builds

/**
* @brief Finds the directory where the native build of the simsimd module is located.
* @param {string} dir - The directory to start the search from.
*/
function getBuildDir(dir: string) {
if (existsSync(path.join(dir, "build"))) return dir;
if (existsSync(path.join(dir, "prebuilds"))) return dir;
Expand Down
10 changes: 8 additions & 2 deletions rust/benches/cosine.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,14 @@ pub fn cos_benchmark(c: &mut Criterion) {
group.bench_with_input(BenchmarkId::new("SimSIMD", i), &i, |b, _| {
b.iter(|| SimSIMD::cosine(&inputs.0, &inputs.1))
});
group.bench_with_input(BenchmarkId::new("Rust Native", i), &i, |b, _| {
b.iter(|| native::cos_similarity_cpu(&inputs.0, &inputs.1))
group.bench_with_input(BenchmarkId::new("Rust Procedural", i), &i, |b, _| {
b.iter(|| native::baseline_cos_procedural(&inputs.0, &inputs.1))
});
group.bench_with_input(BenchmarkId::new("Rust Functional", i), &i, |b, _| {
b.iter(|| native::baseline_cos_functional(&inputs.0, &inputs.1))
});
group.bench_with_input(BenchmarkId::new("Rust Unrolled", i), &i, |b, _| {
b.iter(|| native::baseline_cos_unrolled(&inputs.0, &inputs.1))
});
}
}
Expand Down
Loading

0 comments on commit dbfe5cd

Please sign in to comment.