Move away from using field info to pass parameters #1914

jmazanec15 · 2024-07-31T17:06:25Z

Description

Right now, we add several different attributes in the Lucene fieldType in order to pass parameters from the fieldmapper (opensearch extension) to the KNN80DocValuesConsumer (lucene extension). https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNN80Codec/KNN80DocValuesConsumer.java. The problem with this is that attributes have to be key-value string pairs, so additional complexity is required to properly configure (just see KNN80DocValuesConsumer).

I think we could drastically simplify this by leveraging the KNNVectorFieldType (https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/mapper/KNNVectorFieldMapper.java#L471 - opensearch concept). KNNVectorFieldType has the KNNMethodContext for the field and so can be used to specify pretty much everything we need in order to build the underlying index.

To get these values to the lucene extension (i.e. KNN80DocValuesConsumer and/or NativeEngines990KnnVectorsWriter), we can leverage the per-field formats that have access to the MapperService (and therefore the KnnVectorFieldType and therefore the KNNMethodContext) - https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/codec/KNNCodecService.java#L33. On search, we already have access to the KnnVectorFieldType and KNNMethodContext (https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNQueryBuilder.java#L347).

We would need to ensure in the KNNVectorFieldType we handle cases of model and legacy elegantly.

We could even take it a step further and have the engine return the index creator that the underlying index creation code could use. But maybe one step at a time

Potential benefits:

Simplification of parametrized codec code - we would not need to do branching or parsing of attributes in the index creation logic. Instead, we could properly encapsulate this somewhere else
Minute improvement in performance - because codec is created once per shard, we wouldnt constantly have to configure the index creation logic for each segment

Unknowns:

One challenge may be dynamic mapping parameters that impact indexing - we dont support this now, but may want to in the future. Itd be tricky to do this - some more research would need to be done.

BWC concerns

On write, because the field attributes are per segment (see https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L26), they should not need to be used by us on upgrade, so no issue.

On read, in search we do lookup attributes based on field (see https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNWeight.java#L222) but this info can actually be held in the KNNVectorFieldType, so I dont think we need it.

jmazanec15 added the Refactoring Improve the design, structure, and implementation while preserving its functionality label Jul 31, 2024

github-actions bot added the untriaged label Jul 31, 2024

jmazanec15 removed the untriaged label Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move away from using field info to pass parameters #1914

Move away from using field info to pass parameters #1914

jmazanec15 commented Jul 31, 2024

Move away from using field info to pass parameters #1914

Move away from using field info to pass parameters #1914

Comments

jmazanec15 commented Jul 31, 2024

Description

BWC concerns