Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update k-NN documentation for faiss support feature #280

Merged
merged 7 commits into from
Nov 23, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
324 changes: 273 additions & 51 deletions _search-plugins/knn/api.md

Large diffs are not rendered by default.

176 changes: 161 additions & 15 deletions _search-plugins/knn/approximate-knn.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,32 @@ has_math: true

# Approximate k-NN search

The approximate k-NN method uses [nmslib's](https://github.com/nmslib/nmslib/) implementation of the Hierarchical Navigable Small World (HNSW) algorithm to power k-NN search. In this case, approximate means that for a given search, the neighbors returned are an estimate of the true k-nearest neighbors. Of the three methods, this method offers the best search scalability for large data sets. Generally speaking, once the data set gets into the hundreds of thousands of vectors, this approach is preferred.
The approximate k-NN search method uses nearest neighbor algorithms from *nmslib* and *faiss* to power
k-NN search. To see the algorithms that the plugin currently supports, check out the [k-NN Index documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions).
In this case, approximate means that for a given search, the neighbors returned are an estimate of the true k-nearest
neighbors. Of the three search methods the plugin provides, this method offers the best search scalability for large
data sets. Generally speaking, once the data set gets into the hundreds of thousands of vectors, this approach is
preferred.

The k-NN plugin builds an HNSW graph of the vectors for each "knn-vector field"/ "Lucene segment" pair during indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about Lucene segments, see the [Apache Lucene documentation](https://lucene.apache.org/core/{{site.lucene_version}}/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description). These graphs are loaded into native memory during search and managed by a cache. To learn more about pre-loading graphs into memory, refer to the [warmup API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#warmup-operation). Additionally, you can see what graphs are already loaded in memory, which you can learn more about in the [stats API section]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#stats).
The k-NN plugin builds a native library index of the vectors for each "knn-vector field"/ "Lucene segment" pair during
indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about
Lucene segments, see the [Apache Lucene documentation](https://lucene.apache.org/core/{{site.lucene_version}}/core/org/apache/lucene/codecs/lucene87/package-summary.html#package.description).
These native library indices are loaded into native memory during search and managed by a cache. To learn more about
pre-loading native library indices into memory, refer to the [warmup API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#warmup-operation).
Additionally, you can see what native library indices are already loaded in memory, which you can learn more about in the
[stats API section]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#stats).

Because the graphs are constructed during indexing, it is not possible to apply a filter on an index and then use this search method. All filters are applied on the results produced by the approximate nearest neighbor search.
Because the native library indices are constructed during indexing, it is not possible to apply a filter on an index
and then use this search method. All filters are applied on the results produced by the approximate nearest neighbor
search.

## Get started with approximate k-NN

To use the k-NN plugin's approximate search functionality, you must first create a k-NN index with setting `index.knn` to `true`. This setting tells the plugin to create HNSW graphs for the index.
To use the k-NN plugin's approximate search functionality, you must first create a k-NN index with setting `index.knn`
to `true`. This setting tells the plugin to create native library indices for the index.

Additionally, if you're using the approximate k-nearest neighbor method, specify `knn.space_type` to the space you're interested in. You can't change this setting after it's set. To see what spaces we support, see [spaces](#spaces). By default, `index.knn.space_type` is `l2`. For more information about index settings, such as algorithm parameters you can tweak to tune performance, see [Index settings]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#index-settings).

Next, you must add one or more fields of the `knn_vector` data type. This example creates an index with two `knn_vector` fields and uses cosine similarity:
Next, you must add one or more fields of the `knn_vector` data type. This example creates an index with two
`knn_vector`'s, one using *faiss*, the other using *nmslib*, fields:

```json
PUT my-knn-index-1
Expand Down Expand Up @@ -52,8 +65,8 @@ PUT my-knn-index-1
"dimension": 4,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"space_type": "innerproduct",
"engine": "faiss",
"parameters": {
"ef_construction": 256,
"m": 48
Expand All @@ -65,9 +78,14 @@ PUT my-knn-index-1
}
```

The `knn_vector` data type supports a vector of floats that can have a dimension of up to 10,000, as set by the dimension mapping parameter.
In the example above, both `knn_vector`'s are configured from method definitions. Additionally, `knn_vector`'s can also
be configured from models. Learn more about it [here]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#knn_vector-data-type)!

The `knn_vector` data type supports a vector of floats that can have a dimension of up to 10,000, as set by the
dimension mapping parameter.

In OpenSearch, codecs handle the storage and retrieval of indices. The k-NN plugin uses a custom codec to write vector data to graphs so that the underlying k-NN search library can read it.
In OpenSearch, codecs handle the storage and retrieval of indices. The k-NN plugin uses a custom codec to write vector
data to native library indices so that the underlying k-NN search library can read it.
{: .tip }

After you create the index, you can add some data to it:
Expand Down Expand Up @@ -112,10 +130,131 @@ GET my-knn-index-1/_search
}
```

`k` is the number of neighbors the search of each graph will return. You must also include the `size` option, which indicates how many results the query actually returns. The plugin returns `k` amount of results for each shard (and each segment) and `size` amount of results for the entire query. The plugin supports a maximum `k` value of 10,000.
`k` is the number of neighbors the search of each graph will return. You must also include the `size` option, which
indicates how many results the query actually returns. The plugin returns `k` amount of results for each shard
(and each segment) and `size` amount of results for the entire query. The plugin supports a maximum `k` value of 10,000.

### Building a k-NN index from a model

For some of the algorithms that we support, the native library index needs to be trained before it can be used. Training
everytime a segment is created would be very expensive, so, instead, we introduce the concept of a *model* that is used
to initialize the native library index during segment creation. A *model* is created by calling the [Train API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api#train-model),
passing in the source of training data as well as the method definition of the model. Once training is complete, the
model will be serialized to a k-NN model system index. Then, during indexing, the model is pulled from this index to
initialize the segments.

In order to train a model, we first need an OpenSearch index with training data in it. Training data can come from
any `knn_vector` field that has a dimension matching the dimension of the model you want to create. Training data can be
the same data that you are going to index or a separate set. Let's create a training index:

```json
PUT /train-index
{
"settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 0
},
"mappings": {
"properties": {
"train-field": {
"type": "knn_vector",
"dimension": 4
}
}
}
}
```

Notice that `index.knn` is not set in the index settings. This ensures that we do not create native library indices for
this index.

Next, let's add some data to it:
```json
POST _bulk
{ "index": { "_index": "train-index", "_id": "1" } }
{ "train-field": [1.5, 5.5, 4.5, 6.4]}
{ "index": { "_index": "train-index", "_id": "2" } }
{ "train-field": [2.5, 3.5, 5.6, 6.7]}
{ "index": { "_index": "train-index", "_id": "3" } }
{ "train-field": [4.5, 5.5, 6.7, 3.7]}
{ "index": { "_index": "train-index", "_id": "4" } }
{ "train-field": [1.5, 5.5, 4.5, 6.4]}
...
```

After indexing into the training index completes, we can call our the Train API:
```json
POST /_plugins/_knn/models/_train/my-model
{
"training_index": "train-index",
"training_field": "train-field",
"dimension": 4,
"description": "My models description",
"search_size": 500,
"method": {
"name":"hnsw",
"engine":"faiss",
"parameters":{
"encoder":{
"name":"pq",
"parameters":{
"code_size": 8,
"m": 8
}
}
}
}
}
```

The Train API will return as soon as the training job is started. To check its status, we can use the Get Model API:
```json
GET /_plugins/_knn/models/my-model?filter_path=state&pretty
{
"state": "training"
}
```

Once the model enters the "created" state, we can create an index that will use this model to initialize it's native
library indices:
```json
PUT /target-index
{
"settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 1,
"index.knn": true
},
"mappings": {
"properties": {
"target-field": {
"type": "knn_vector",
"model_id": "my-model"
}
}
}
}
```

Lastly, we can add the documents we want to be searched to the index:
```json
POST _bulk
{ "index": { "_index": "target-index", "_id": "1" } }
{ "target-field": [1.5, 5.5, 4.5, 6.4]}
{ "index": { "_index": "target-index", "_id": "2" } }
{ "target-field": [2.5, 3.5, 5.6, 6.7]}
{ "index": { "_index": "target-index", "_id": "3" } }
{ "target-field": [4.5, 5.5, 6.7, 3.7]}
{ "index": { "_index": "target-index", "_id": "4" } }
{ "target-field": [1.5, 5.5, 4.5, 6.4]}
...
```

After data is ingested, it can be search just like any other `knn_vector` field!

### Using approximate k-NN with filters
If you use the `knn` query alongside filters or other clauses (e.g. `bool`, `must`, `match`), you might receive fewer than `k` results. In this example, `post_filter` reduces the number of results from 2 to 1:
If you use the `knn` query alongside filters or other clauses (e.g. `bool`, `must`, `match`), you might receive fewer
than `k` results. In this example, `post_filter` reduces the number of results from 2 to 1:

```json
GET my-knn-index-1/_search
Expand All @@ -142,7 +281,12 @@ GET my-knn-index-1/_search

## Spaces

A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores, we take 1 / (1 + distance). Currently, the k-NN plugin supports the following spaces:
A space corresponds to the function used to measure the distance between two points in order to determine the k-nearest
neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how
OpenSearch scores results, where a greater score equates to a better result. To convert distances to OpenSearch scores,
we take 1 / (1 + distance). The k-NN plugin the spaces the plugin supports are below. Not every method supports each of
these spaces. Be sure to check out [the method documentation]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index#method-definitions) to make sure the space you are
interested in is supported.

<table>
<thead style="text-align: left">
Expand Down Expand Up @@ -181,5 +325,7 @@ A space corresponds to the function used to measure the distance between two poi
</tr>
</table>

The cosine similarity formula does not include the `1 -` prefix. However, because nmslib equates smaller scores with closer results, they return `1 - cosineSimilarity` for their cosine similarity space---that's why `1 -` is included in the distance function.
The cosine similarity formula does not include the `1 -` prefix. However, because similarity search libraries equates
smaller scores with closer results, they return `1 - cosineSimilarity` for cosine similarity space---that's why `1 -` is
included in the distance function.
{: .note }
2 changes: 1 addition & 1 deletion _search-plugins/knn/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ This plugin supports three different methods for obtaining the k-nearest neighbo

1. **Approximate k-NN**

The first method takes an approximate nearest neighbor approach---it uses the HNSW algorithm to return the approximate k-nearest neighbors to a query vector. This algorithm sacrifices indexing speed and search accuracy in return for lower latency and more scalable search. To learn more about the algorithm, please refer to [nmslib's documentation](https://github.com/nmslib/nmslib/) or [the paper introducing the algorithm](https://arxiv.org/abs/1603.09320).
The first method takes an approximate nearest neighbor approach---it uses one of several different algorithms to return the approximate k-nearest neighbors to a query vector. Usually, these algorithms sacrifice indexing speed and search accuracy in return for performance benefits such as lower latency, smaller memory footprints and more scalable search. To learn more about the algorithms, please refer to [*nmslib*](https://github.com/nmslib/nmslib/blob/master/manual/README.md)'s and [*faiss*](https://github.com/facebookresearch/faiss/wiki)'s documentation.

Approximate k-NN is the best choice for searches over large indices (i.e. hundreds of thousands of vectors or more) that require low latency. You should not use approximate k-NN if you want to apply a filter on the index before the k-NN search, which greatly reduces the number of vectors to be searched. In this case, you should use either the script scoring method or painless extensions.

Expand Down
17 changes: 17 additions & 0 deletions _search-plugins/knn/jni-libraries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
layout: default
title: JNI libraries
nav_order: 6
parent: k-NN
has_children: false
---

# JNI libraries

To integrate [*nmslib*'s](https://github.com/nmslib/nmslib/) and [*faiss*'s](https://github.com/facebookresearch/faiss/) Approximate k-NN functionality (implemented in C++) into the k-NN plugin (implemented in Java), we created a Java Native Interface, which lets the k-NN plugin make calls to the native libraries. To implement this, we create 3 libraries: `libopensearchknn_nmslib`, the JNI library that interfaces with nmslib, `libopensearchknn_faiss`, the JNI library that interfaces with faiss, and `libopensearchknn_common`, a library containing common shared functionality between native libraries.

The libraries `libopensearchknn_faiss` and `libopensearchknn_nmslib` are lazily loaded when they are first called in the plugin. This means that if you are only planning on using one of the libraries, the other one will never be loaded.

For building the libraries from source, please refer to the [DEVELOPER_GUIDE](https://github.com/opensearch-project/k-NN/blob/main/DEVELOPER_GUIDE.md).

For more information about JNI, see [Java Native Interface](https://en.wikipedia.org/wiki/Java_Native_Interface) on Wikipedia.
13 changes: 0 additions & 13 deletions _search-plugins/knn/jni-library.md

This file was deleted.

Loading