Add support to build vector data structures greedily and perform exact search when there are no engine files #2188

VijayanB · 2024-10-04T20:03:45Z

Description

Added new updatable index setting "approximate_threshold", which will be
considered when to build graph or not for native engines. This is noop for lucene.
When graph is not available, plugin will return empty results. With this change, exact search will be performed when only no engine file is available in segment.
Radial search was never included as part of exact search. This will break radial search when threshold
is added and radial search is requested. Hence, introduce exact search during radial search similar to Approx search.

Related Issues

#1942

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

… creation (opensearch-project#2007) Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build braph or not for native engines. This is noop for lucene. This depends on use lucene format as prerequisite. We don't need to add flag since it is only enable if lucene format is already enabled. Signed-off-by: Vijayan Balasubramanian <[email protected]>

Signed-off-by: Vijayan Balasubramanian <[email protected]>

…ject#2175) Previosuly we only added support to build greedily for non quantization scenario. In this commit, we can remove that constraint, however, we cannot skip writing quanitization state since it is required irrespective of type of search is executed later. Signed-off-by: Vijayan Balasubramanian <[email protected]>

…project#2136) * Add exact search if no engine files are in segments When graph is not available, plugin will return empty results. With this change, exact search will be performed when only no engine file is available in segment. We also don't need version check or feature flag because, option to not build vector data structure will only be available post 2.17. If an index is created using pre 2.17 version, segment will always have engine files and this feature will never be called during search. --------- Signed-off-by: Vijayan Balasubramanian <[email protected]>

* Add support for radial search in exact search When threshold value is set, knn plugin will not be creating graph. Hence, when search request is trigged during that time, exact search will return valid results. However, radial search was never included as part of exact search. This will break radial search when threshold is added and radial search is requested. In this commit, new method is introduced to accept min score and return documents that are greater than min score, similar to how radial search is performed by native engines. This search is independent of engine, but, radial search is supported only for FAISS engine out of all native engines. Signed-off-by: Vijayan Balasubramanian <[email protected]> --------- Signed-off-by: Vijayan Balasubramanian <[email protected]>

Signed-off-by: Vijayan Balasubramanian <[email protected]>

VijayanB · 2024-10-09T02:44:51Z

In this comment we will be looking at some of the options on how to expose a setting “approximate_threshold” which controls when to build vector data structure during flush and merge operation. If no value is provided during index creation, we will use default value to decide whether to build vector data structure or not during flush and merge operations.

The possible values for this parameters are:

-1 => When users want to disable building vector data structure during indexing
0 => When users want to build vector data structure always ( this is the behavior till 2.17)
1.. INTEGER.MAX_VALUE - 2 => Will build vector data structure only when a segment contains more than this number of documents.

Proposal

1. Keep “approximate_threshold” as dynamic index setting

In this approach, users can set this value per index. They can use “Update Index Setting” API to update this value on live index at any time.

Example:

PUT my-test-knn-index/_settings
{
      "index.knn.advanced.approximate_threshold":-1
}

Pros:

It is easy to update setting by just passing only the property and the desired value.
This setting will be shared by all vector fields from same index. Users don’t have to update individual fields explicitly.
This is aligned with other property “index.knn.advanced.filtered_exact_search_threshold"

Cons

Different vector fields from same index cannot have different value.

2. Keep “approximate_threshold” as mapping parameter

In this approach, users can set this value as mapping parameter for knn vector field type. They can use “Update Mapping Setting” API to update this value.

Example:

PUT my-test-knn-index/_mappings
{
  "properties": {
    "my_vector1": {
      "type": "knn_vector",
      "space_type": "l2",
      "dimension": 2,
`      "approximate_threshold": -1,`
      "method": {
        "name": "hnsw",
        "engine": "nmslib",
        "parameters": {
          "ef_construction": 128,
          "m": 24
        }
      }
    }
  }
}

Pros

Different vector fields from same index can have different value. This gives more control to the users since they can tune this parameter individually at field level.
This works well if user is going to set it and forget during index creation step.

Cons

This doesn’t provide very good user experience if users want to update the value after index creation step. we would like to provide same experience as index setting, for example.

PUT my-test-knn-index/_mappings
{
  "properties": {
    "my_vector1": {
      "type": "knn_vector",
      "approximate_threshold": -1
    }
  }
}

But in order to update a mapping parameter, users has to pass other non-updatable parameters as part of update mapping api request. In above example, to update “approximate_threshold", we had to first use GET Setting API to get the non-updatable properties like type, space_type, dimension, methods etc and then use exact value in update mapping api to change ”approximate_threshold".

3. Combine 1 & 2

In this approach we will provide this as both index setting and mapping parameter, where index setting will be considered as global setting for every field, and, users can override this value for any specific field by updating as mapping parameter.

Pros

It is easy to update index setting by just passing only the property and the desired value.
The index setting will be shared by all vector fields from same index. Users don’t have to update individual fields explicitly.
This is aligned partially with other property “index.knn.advanced.filtered_exact_search_threshold"
Different vector fields from same index can have different value by overriding at field level. This gives more control to the users since they can tune this parameter individually at field level.

Cons

This doesn’t solve the problem of having to pass non-updatable field as part of update mapping api request. However, this will be only considered in a specific case where users had to override this value for certain fields. This significantly reduces the nuances for most of the use case where user either has an index with one vector field or they want to use same setting for every vector field.

Finalized approach

Since there is no work around to use as mapping parameter and provide better user experience at same time, we can first introduce this as “index setting”, (Proposal 1 ) and, if we receive substantial feedback on having it as mapping parameter from community, we can introduce mapping parameter in next iteration**(effectively Proposal 3).**

VijayanB · 2024-10-09T02:50:50Z

Link to Documentation issue opensearch-project/documentation-website#8482

jmazanec15

Overall, I think this looks pretty good

src/main/java/org/opensearch/knn/index/codec/BasePerFieldKnnVectorsFormat.java

src/test/java/org/opensearch/knn/index/OpenSearchIT.java

Signed-off-by: Vijayan Balasubramanian <[email protected]>

VijayanB · 2024-10-09T19:13:09Z

Test failures are not due to this PR.

VijayanB requested review from heemin32, navneet1v, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, ryanbogan and luyuncheng as code owners October 4, 2024 20:03

VijayanB self-assigned this Oct 4, 2024

VijayanB added Roadmap:Vector Database/GenAI Project-wide roadmap label v2.18.0 and removed Roadmap:Vector Database/GenAI Project-wide roadmap label labels Oct 4, 2024

VijayanB added 8 commits October 8, 2024 15:11

Add integration test for binary vector values (opensearch-project#2142)

ea66747

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Update CHANGELOG

e479b73

Signed-off-by: Vijayan Balasubramanian <[email protected]>

update setting name

1fe9284

Signed-off-by: Vijayan Balasubramanian <[email protected]>

Fix spotless

0893e1f

Signed-off-by: Vijayan Balasubramanian <[email protected]>

VijayanB force-pushed the merge-to-main branch 2 times, most recently from b124156 to 76f8a41 Compare October 8, 2024 23:44

jmazanec15 reviewed Oct 9, 2024

View reviewed changes

Rename variables

2501ad2

Signed-off-by: Vijayan Balasubramanian <[email protected]>

VijayanB force-pushed the merge-to-main branch from 76f8a41 to 2501ad2 Compare October 9, 2024 16:18

VijayanB requested a review from jmazanec15 October 9, 2024 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to build vector data structures greedily and perform exact search when there are no engine files #2188

Add support to build vector data structures greedily and perform exact search when there are no engine files #2188

VijayanB commented Oct 4, 2024 •

edited

Loading

VijayanB commented Oct 9, 2024

VijayanB commented Oct 9, 2024

jmazanec15 left a comment

VijayanB commented Oct 9, 2024

Add support to build vector data structures greedily and perform exact search when there are no engine files #2188

Are you sure you want to change the base?

Add support to build vector data structures greedily and perform exact search when there are no engine files #2188

Conversation

VijayanB commented Oct 4, 2024 • edited Loading

Description

Related Issues

Check List

VijayanB commented Oct 9, 2024

Proposal

1. Keep “approximate_threshold” as dynamic index setting

Example:

Pros:

Cons

2. Keep “approximate_threshold” as mapping parameter

Example:

Pros

Cons

3. Combine 1 & 2

Pros

Cons

Finalized approach

VijayanB commented Oct 9, 2024

jmazanec15 left a comment

Choose a reason for hiding this comment

VijayanB commented Oct 9, 2024

VijayanB commented Oct 4, 2024 •

edited

Loading