Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move away from file watcher for releasing memory #1885

Open
jmazanec15 opened this issue Jul 25, 2024 · 25 comments
Open

Move away from file watcher for releasing memory #1885

jmazanec15 opened this issue Jul 25, 2024 · 25 comments
Assignees
Labels
Enhancements Increases software capabilities beyond original client specifications Roadmap:Vector Database/GenAI Project-wide roadmap label v2.18.0

Comments

@jmazanec15
Copy link
Member

Description

Right now, we free memory for segments whose files get deleted by having a resource watcher watch for those files:

  1. Init of resource watcher - https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/plugin/KNNPlugin.java#L196
  2. File change watcher initialization - https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/memory/NativeMemoryLoadStrategy.java#L84
  3. File watcher - https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/memory/NativeMemoryLoadStrategy.java#L97

This file watching can actually be completely removed and it would do some good in simplifying the code (less static initialization, decoupling between cache manager and native memory load strategy, moving away from tight coupling of FSDirectory, etc). And additionally, how we do it now is a bug - see #1012.

To fix this, we just need to ensure that for either our DocValuesProducer and our KnnVectorsReader, when close is called, we evict any indices from memory (similar to how we do it now with the filewatcher). We would actually need to implement a really light weight DocValuesProducer that delegates for everything.

Related issues

#1012

@luyuncheng
Copy link
Collaborator

luyuncheng commented Jul 26, 2024

hi @jmazanec15 i am writing a DocValuesProducer for KNN80Codec for save store usage. i like this issues very much. it can release memory efficiently. i can help contribute my DocValuesProducer if you like

@heemin32
Copy link
Collaborator

Is the lifecycle of native memory(native knn index) same as the lifecycle of DocValuesProducer or KnnVectorsReader?
I thought the native memory stays in memory until either we clear cache or the index is unusable(delete or close).

Does close is called for either DocValuesProducer or KnnVectorsReader when the index is closed or deleted?

@jmazanec15
Copy link
Member Author

hi @jmazanec15 i am writing a DocValuesProducer for KNN80Codec for save store usage. i like this issues very much. it can release memory efficiently. i can help contribute my DocValuesProducer if you like

Awesome! What custom functionality did you add in DVProducer to reduce storage? I think the change for this issue would be quite small so I do not see it conflicting.

Is the lifecycle of native memory(native knn index) same as the lifecycle of DocValuesProducer or KnnVectorsReader?

DocValuesProducer or KnnVectorsReader are both per segment. They two producer/reader classes are opened (or shared) in https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentReader.java ctor.

On close, the close methods will eventually be called:

  1. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentReader.java#L220-L231
  2. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentCoreReaders.java#L172-L179

This reader will be required in the search path: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/query/KNNWeight.java#L206

@luyuncheng
Copy link
Collaborator

Awesome! What custom functionality did you add in DVProducer to reduce storage? I think the change for this issue would be quite small so I do not see it conflicting.

@jmazanec15

in #1571 we can save the storage for _source store field.

and i think we can save the docValues storage like #1571 (comment)

we create 2 types docValues in Lucene engine. also we can get docValues from native engines, in faiss we can use reconstruct_n function to get i'th vector values from flat storage which would keep it in memory after it opens

@heemin32
Copy link
Collaborator

@jmazanec15 Right. However, does producer/reader classes are singleton and stay open until index is closed or deleted?
Otherwise, the close method might get called even if we don't want the native memory cache to be evicted. Or, do we want to reload the cache when new producer/reader classes are created?

@jmazanec15
Copy link
Member Author

@luyuncheng thats interesting. I think @navneet1v may have been thinking around that. We are in process of migrating away from Custom Doc Values to KNNVectorFormat. See #1855. Would you be able to do it with the KNNVectorReader instead of doc values?

@jmazanec15
Copy link
Member Author

@heemin32 Oh I think I see what you are saying. Will there be multiple readers/producers per segment/search. I think the purpose of SegmentCoreReaders is to ensure that they can be shared.

Worth noting that for lucene, the input is closed as well (see https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L325).

@heemin32
Copy link
Collaborator

@heemin32 Oh I think I see what you are saying. Will there be multiple readers/producers per segment/search. I think the purpose of SegmentCoreReaders is to ensure that they can be shared.

Sharing is good but one other thing is whether the reader get closed and opened repetitively regardless the existence of segment file.

@jmazanec15
Copy link
Member Author

closed and opened repetitively regardless the existence of segment file

Im not sure why this would happen. Need to be sure though before changing.

@heemin32
Copy link
Collaborator

I am don't know how it works internally but based on this comment, it seems like multiple class of SegmentReader can be created. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/SegmentReader.java#L128

/**
   * Create new SegmentReader sharing core from a previous SegmentReader and using the provided
   * liveDocs, and recording whether those liveDocs were carried in ram (isNRT=true).
   */

@navneet1v
Copy link
Collaborator

@jmazanec15 thanks for opening up the issue and I see already some great ideas on the gh issue. Here is my true sense here:

  1. @luyuncheng I would like to really see how we can use the vectors present in Faiss index rather than having a separate file in segment for flat vectors storage. Also how much will be the latency in mapping those vectors back to heap for say exact search. As @jmazanec15 already mentioned we are in process of moving to KNNVectorFormat by 2.17(RFC: [RFC] Integrating KNNVectorsFormat in Native Vector Search Engine #1853). So please have a look in that Reader code. The skeleton has been added by this PR: Reuse KNNVectorFieldData for reduce disk usage #1571 , more I will be adding in coming weeks.
  2. @heemin32 on the idea of will there be multiple instance of DVP or not, I think its no. Lucene maintains a readerPool at its end to ensure that it hands out same reader whoever asked for it, check code here, here. But lets say if its not true, then we can handle this case easily via refCount. We can do a small experiment to validate, since the code of Lucene gives mix signal.
  3. Now on opening and closing of DVP multiple times the ans is again no. A DVP gets open in case of segment creation and then gets closed if the segment is deleted, and a DVP gets opened/closed when you do refresh.

@jmazanec15
Copy link
Member Author

But lets say if its not true, then we can handle this case easily via refCount

Right, refCounting would prevent issues, but make removal a little trickier.

@kotwanikunal
Copy link
Member

kotwanikunal commented Aug 12, 2024

Read through the issue here - refCount would be a solution, but it is not a direct solution for this problem.
Going through the code - it looks like we want the cache manager to process index lifecycle events.

My proposal would be to separate the concerns and instead follow this pattern to handle lifecycle events
(The method names are hypothetical and we can see what events fit best for kNN)

public void onIndexModule(IndexModule module) {
        module.addSettingsUpdateConsumer(INDEX_KNN_ALGO_PARAM_EF_SEARCH_SETTING, newVal -> {
            logger.debug("The value of [KNN] setting [{}] changed to [{}]", KNN_ALGO_PARAM_EF_SEARCH, newVal);
            // TODO: replace cache-rebuild with index reload into the cache
            NativeMemoryCacheManager.getInstance().rebuildCache();
        });
        module.addIndexEventListener(new NativeCacheIndexEventListener());
    }

    static class NativeCacheIndexEventListener implements IndexEventListener {
        @Override
        public void beforeIndexShardClosed(ShardId shardId, IndexShard indexShard, Settings indexSettings) {
             NativeMemoryCacheManager.getInstance().shardClosed();
        }

        @Override
        public void shardRoutingChanged(IndexShard indexShard, ShardRouting oldRouting, ShardRouting newRouting) {
            NativeMemoryCacheManager.getInstance().shardMovedAway();
        }

        @Override
        public void afterIndexRemoved(Index index, IndexSettings indexSettings, IndicesClusterStateService.AllocatedIndices.IndexRemovalReason reason) {
             NativeMemoryCacheManager.getInstance().indexCleanup();
        }
    }

Thoughts?

@jmazanec15
Copy link
Member Author

@kotwanikunal My issue with the index life cycle is that we would also need to keep the file watcher code, which I dont like because its a direct dependency on file system and directory implementation. Thus, if we are able to properly ref-count and free on segment docvalues producer code, then we will be able to solve both issues at one go.

@kotwanikunal
Copy link
Member

@kotwanikunal My issue with the index life cycle is that we would also need to keep the file watcher code, which I dont like because its a direct dependency on file system and directory implementation. Thus, if we are able to properly ref-count and free on segment docvalues producer code, then we will be able to solve both issues at one go.

I see the problem now. We have to track at the file level and not the shard level. I was wondering if we could simply filter cache entries on receiving the shard lifecycle events and evict them (in a way - skip the file watcher all together) - but the cache operates at a more granular level.

@luyuncheng
Copy link
Collaborator

I see the problem now. We have to track at the file level and not the shard level. I was wondering if we could simply filter cache entries on receiving the shard lifecycle events and evict them (in a way - skip the file watcher all together) - but the cache operates at a more granular level.

@kotwanikunal when we have shard close, or file remove, segment merged events, it can release memory at producer level calling close. and i think we need refCount like #1885 (comment) and #1885 (comment) says. it is because that when there is an shard events happened before an segment query at concurrent search scenarios, refcount can help avoid memory segment fault.

i think DocValuesProducer is a simple way as the issue says:

We would actually need to implement a really light weight DocValuesProducer that delegates for everything.

@0ctopus13prime
Copy link
Contributor

Before cutting out the absolute file path dependency in KNN, this should be resolved first before this issue : #2033

@navneet1v navneet1v added the Enhancements Increases software capabilities beyond original client specifications label Sep 19, 2024
@0ctopus13prime
Copy link
Contributor

After a deep dive, we can safely remove FileWatcher. The issue it was trying to solve has already been addressed.
Let me share a clean-up PR shortly.

@0ctopus13prime
Copy link
Contributor

0ctopus13prime commented Sep 26, 2024

Removing FSDirectory Dependency in OpenSearch KNN

1. Goal

This document outlines the rationale for deprecating FileWatcher in OpenSearch and presents a cleanup plan. By providing background on the primary issues with FileWatcher and explaining how its original objective has already been addressed, we will justify the decision for its removal.

2. Background

Issue : Move away from file watcher for releasing memory #1885

A project (#2033) is underway to introduce a loading and writing layer within native engines — FAISS and NMSLIB. This layer provides a read stream that wraps Lucene’s IndexInput and passes it to the native engines. By doing so, the native engines can read bytes via the provided read interface, effectively decoupling them from direct dependency on FSDirectory which relies on system file APIs to write and read bytes, for example fread or fwrite.

However, even with this change, there is still places where they rely on an absolute file path — FileWatcher.
When IndexLoadStrategy loads a vector index, it tries to look up in a cache (e.g. NativeMemoryCacheManager) first. If it does not present in there, it would call native engine API to load the vector index into memory then put it in a cache.
At the same time, it creates FileWatcher to track the vector index file in file system of host. In which, it obtains an absolute path of vector index file after casted to FSDirectory, then it periodically monitors the file status. It will evict the cached vector index once it noticed the tracking index file was deleted. Consequently, the allocated memory will be automatically released during the eviction process.

Therefore, in order to entirely eliminate FSDirectory dependency in OpenSearch, we need to find a better approach of cleaning up a loaded vector indices.
However, we recently introduced a method of cleaning up vector index resources, making FileWatcher unnecessary.

3. Problem Definitions

The current approach of using FileWatcher has two problems.

  1. FSDirectory dependency
    This prevents the system from utilizing general Directory implementations in KNN. Specifically, the hard-coded casting to FSDirectory restricts the use of non-FSDirectory in OpenSearch.
  2. Lazy Memory Deallocation
    Even when a vector index is no longer in use, such as after calling _close API, the allocated memory remains held in memory. The memory is only released when the user sends an index delete request to remove the physical index file from the host. This issue was previously reported in [BUG] KNN doesn't release memory when close index

4. Solution - Free Lunch

Fortunately, luyuncheng has already done a great job in #1946 where it cleans up allocated vector indices after receiving a close notification from Lucene.
Lucene uses a reference counting mechanism, similar to std::shared_ptr, to clean up resources when the count reaches zero. In this process, Lucene calls the close method of DocValuesProducer when it is no longer in use. luyuncheng’s work hooks into this mechanism to clean up vector index resources from both cache and memory.
As a result, we no longer need to rely on FileWatcher to clean resources from the cache. By the time it identifies that the tracked file is gone, the corresponding entry will have already been removed from the cache — way before that point.

Now, all we need to do is just remove FileWatcher from NativeMemoryLoadStrategy.IndexAllocation.

@Override
public NativeMemoryAllocation.IndexAllocation load(NativeMemoryEntryContext.IndexEntryContext indexEntryContext)
    throws IOException {
 ~~~~ ~~final~~ ~~~~ ~~Path~~ ~~absoluteIndexPath~~ ~~=~~ ~~~~ ~~Paths~~~~.~~~~get~~~~(~~~~indexEntryContext~~~~.~~~~getKey~~~~());~~
   // Ex: _0_165_my_vector.faiss
   final String vectorIndexFileName = indexEntryContext.getKey();
    final KNNEngine knnEngine = KNNEngine.getEngineNameFromPath(vectorIndexFileName);

 ~~~~ ~~final~~ ~~~~ ~~FileWatcher~~ ~~fileWatcher~~ ~~=~~ ~~~~ ~~new~~ ~~~~ ~~FileWatcher~~~~(~~~~absoluteIndexPath~~~~);~~
 ~~fileWatcher~~~~.~~~~addListener~~~~(~~~~indexFileOnDeleteListener~~~~);~~
 ~~fileWatcher~~~~.~~~~init~~~~();~~

    final Directory directory = indexEntryContext.getDirectory();
    final long indexSize = directory.fileLength(logicalIndexPath);

    try (IndexInput readStream = directory.openInput(logicalIndexPath, IOContext.READONCE)) {
        IndexInputWithBuffer indexInputWithBuffer = new IndexInputWithBuffer(readStream);
        long indexAddress = JNIService.loadIndex(indexInputWithBuffer, indexEntryContext.getParameters(), knnEngine);

        return createIndexAllocation(indexEntryContext, knnEngine, indexAddress, fileWatcher, indexSize, absoluteIndexPath);
    }
}

4.2.3. Lucene's Doc Value Reader Cycle

All doc value readers ultimately derive from the Codec. The KNN990Codec is used as the default codec in the main branch of OpenSearch KNN, with its doc value format being KNNFormatFacade, which internally delegates to KNN80DocValuesFormat. In turn, KNN80DocValuesFormat produces KNN80DocValuesProducer, which is responsible for closing the tracked vector index entries from the cache.

KNN80DocValuesProducer

@Override
public void close() throws IOException {
    for (String path : indexPathMap.values()) {
        nativeMemoryCacheManager.invalidate(path); <------- Cleaning resources. 
    }
    delegate.close();
}

Lucene uses a reference counting mechanism to determine when a resource is safe to clean up. When the count reaches zero, it indicates the resource is no longer in use and can be safely closed. This mechanism is fundamentally same to std::shared_ptr, which deallocates memory once it is no longer needed.
DocValuesProducer is one of the resources managed by this mechanism. As a result, its close method will eventually be called, triggering the eviction of the corresponding vector index. During this eviction, the memory is freed by invoking JNIService.free()

For more information, refer to:

4.2.4. Twin of KNN80DocValuesFormatKNN990DocValuesFormat, KNN990DocValuesProducer

However, we still need a modification in KNN80DocValuesFormat where it is casting to FSDirectory to get an absolute file path. Since KNN80DocValuesFormat is not only being used in KNN990Codec, but it is widely used by variety codecs, unfortunately, we cannot just directly change the logic in it.
Instead, I propose to copy KNN80DocValuesFormat to KNN990DocValuesFormat and change the constructor where it is the only place casting to FSDirectory.

KNN990DocValuesProducer

public KNN990DocValuesProducer(DocValuesProducer delegate, SegmentReadState state) {
    this.delegate = delegate;
    this.state = state;
    this.nativeMemoryCacheManager = NativeMemoryCacheManager.getInstance();

 ~~~~ ~~Directory~~ ~~directory~~ ~~=~~ ~~state~~~~.~~~~directory~~~~;~~
 ~~~~ ~~// directory would be CompoundDirectory, we need get directory firstly and then unwrap~~
 ~~~~ ~~if~~ ~~~~ ~~(~~~~state~~~~.~~~~directory~~ ~~instanceof~~ ~~KNN80CompoundDirectory~~~~)~~ ~~~~ ~~{~~
 ~~directory~~ ~~=~~ ~~~~ ~~((~~~~KNN80CompoundDirectory~~~~)~~ ~~state~~~~.~~~~directory~~~~).~~~~getDir~~~~();~~
 ~~~~ ~~}~~

 ~~~~ ~~Directory~~ ~~dir~~ ~~=~~ ~~~~ ~~FilterDirectory~~~~.~~~~unwrap~~~~(~~~~directory~~~~);~~
 ~~~~ ~~if~~ ~~~~ ~~(!(~~~~dir~~ ~~instanceof~~ ~~~~ ~~FSDirectory~~~~))~~ ~~~~ ~~{~~
 ~~log~~~~.~~~~warn~~~~(~~~~"{} can not casting to FSDirectory"~~~~,~~ ~~directory~~~~);~~
 ~~~~ ~~return~~~~;~~
 ~~~~ ~~}~~
 ~~~~ ~~String~~ ~~directoryPath~~ ~~=~~ ~~~~ ~~((~~~~FSDirectory~~~~)~~ ~~dir~~~~).~~~~getDirectory~~~~().~~~~toString~~~~();~~
    for (FieldInfo field : state.fieldInfos) {
        if (!field.attributes().containsKey(KNN_FIELD)) {
            continue;
        }
        // Only Native Engine put into indexPathMap
        KNNEngine knnEngine = getNativeKNNEngine(field);
        if (knnEngine == null) {
            continue;
        }
        List<String> engineFiles = KNNCodecUtil.getEngineFiles(knnEngine.getExtension(), field.name, state.segmentInfo);
 ~~~~ ~~Path~~ ~~indexPath~~ ~~=~~ ~~~~ ~~PathUtils~~~~.~~~~get~~~~(~~~~directoryPath~~~~,~~ ~~engineFiles~~~~.~~~~get~~~~(~~~~0~~~~));~~

       // Ex: _0_165_my_vector.faiss
       // See KNNCodecUtil.buildEngineFileName for more information.
       final String indexFileName = engineFiles.get(0);

        indexPathMap.putIfAbsent(field.getName(), indexFileName);
 ~~indexPathMap~~~~.~~~~putIfAbsent~~~~(~~~~field~~~~.~~~~getName~~~~(),~~ ~~indexPath~~~~.~~~~toString~~~~());~~
    }
}
        
!! Other parts will be identical to KNN80DocValuesProducer !!

KNN990DocValuesFormat

public class KNN990DocValuesFormat extends DocValuesFormat {
    !! Other parts will be identical to KNN80DocValuesFormat !!

    @Override
    public DocValuesProducer fieldsProducer(SegmentReadState state) throws IOException {
        return new KNN990DocValuesProducer(delegate.fieldsProducer(state), state);
    }
}

KNNCodecVersion.V_9_9_0

V_9_9_0(
    "KNN990Codec",
    new Lucene99Codec(),
    new KNN990PerFieldKnnVectorsFormat(Optional.empty()),
    (delegate) -> new KNNFormatFacade(
 ~~new KNN80DocValuesFormat(delegate.docValuesFormat()),~~
        new KNN990DocValuesFormat(delegate.docValuesFormat()),
        new KNN80CompoundFormat(delegate.compoundFormat())
    ),
    (userCodec, mapperService) -> KNN990Codec.builder()
        .delegate(userCodec)
        .knnVectorsFormat(new KNN990PerFieldKnnVectorsFormat(Optional.ofNullable(mapperService)))
        .build(),
    KNN990Codec::new
);

4.2.5. KNNWeight

Now, we can remove the FSDirectory dependency from KNNWeight entirely. Previously, KNNWeight casts to FSDirectory to obtain an absolute path as a cache key. But with change, using just the vector index file name will be sufficient to retrieve or load the vector index from the cache.


private Map<Integer, Float> doANNSearch(
    final LeafReaderContext context,
    final BitSet filterIdsBitSet,
    final int cardinality,
    final int k
) throws IOException {
 ~~String directory = ((FSDirectory) FilterDirectory.unwrap(reader.directory())).getDirectory().toString();~~
    ...
 ~~~~ ~~Path~~ ~~indexPath~~ ~~=~~ ~~~~ ~~PathUtils~~~~.~~~~get~~~~(~~~~directory~~~~,~~ ~~engineFiles~~~~.~~~~get~~~~(~~~~0~~~~));~~
    final String indexFileName = engineFiles.get(0);
    final KNNQueryResult[] results;
    KNNCounter.GRAPH_QUERY_REQUESTS.increment();

    // We need to first get index allocation
    NativeMemoryAllocation indexAllocation;
    try {
        indexAllocation = nativeMemoryCacheManager.get(
        new NativeMemoryEntryContext.IndexEntryContext(
            reader.directory(),
 ~~indexPath~~~~.~~~~toString~~~~(),~~
            indexFileName,
            NativeMemoryLoadStrategy.IndexLoadStrategy.getInstance(),
            getParametersAtLoading(
                spaceType,
                knnEngine,
                ...
    ...
}

4.2.6. Other Miscellaneous

  • NativeMemoryLoadStrategy
    • Remove FileWatcher
    • Remove WatcherHandle
  • NativeMemoryAllocation.IndexAllocation
    • Remove WatcherHandle

4.2.7. Pros / Cons

4.2.7.1. Pros

  1. It could not be simpler than this.
  2. No side effect. The problem was solved already.

4.2.7.2. Cons

This is the rare and ideal scenario in IT world — Free Lunch: The problem we are trying to solve has already been addressed by somebody else. All we need is to do now is remove the redundant parts.

5. Demo

[Image: Image.jpg]

5.1. Stdout Print

I added System.out.println logging in three places.

5.1.1. Lucene990DocValuesProducer

In here, it logs two things.

  1. All entries in cache.
  2. Cache keys - indexPathMap

This will print that a singleton cache status after the close method called.


@Override
public void close() throws IOException {
    for (String path : indexPathMap.values()) {
        nativeMemoryCacheManager.invalidate(path);
    }
    
    // TMP
    synchronized (Object.class) {
        System.out.println("[KDY] KNN990DocValuesProducer ==============================================");
        System.out.println("[KDY] Cache manager ==============================");
        System.out.println("[KDY] " + nativeMemoryCacheManager.kdyGetKeys());
        System.out.println("[KDY] indexPathMap ==============================");
        System.out.println("[KDY] " + indexPathMap);
    }
    // TMP

    delegate.close();
}

5.1.2. NativeMemoryAllocation.IndexAllocation.load(NativeMemoryEntryContext.IndexEntryContext indexEntryContext)

This is called whenever the vector index is not present in cache before it loads vector index.


@Override
public NativeMemoryAllocation.IndexAllocation load(NativeMemoryEntryContext.IndexEntryContext indexEntryContext)
    throws IOException {
    // final Path absoluteIndexPath = Paths.get(indexEntryContext.getKey());
    final String vectorIndexFileName = indexEntryContext.getKey();
    final KNNEngine knnEngine = KNNEngine.getEngineNameFromPath(vectorIndexFileName);

    // TMP
    synchronized (Object.class) {
        System.out.println("[KDY] ^^^ NativeMemoryLoadStrategy::IndexLoadStrategy ============================");
        System.out.println("[KDY] Cache key : " + vectorIndexFileName);
        System.out.println("[KDY] KNN engine: " + knnEngine);
        System.out.println("[KDY] Cache contents : " + NativeMemoryCacheManager.getInstance().kdyGetKeys());
        System.out.println("[KDY] $$$ NativeMemoryLoadStrategy::IndexLoadStrategy ============================");
    }
    // TMP
    
    ...

5.1.3. JNIService.free

This will log before deallocating vector index.


public static void free(final long indexPointer, final KNNEngine knnEngine, final boolean isBinaryIndex) {
    // TMP
    synchronized (Object.class) {
        System.out.println("[KDY] ^^^ JNIService::free ============================");
        System.out.println("[KDY] knnEngine: " + knnEngine.getName());
        System.out.println("[KDY] $$$ JNIService::free ============================");
    }
    // TMP
    ...

5.2. Demo 1. Closing Index

1. Create a new schema.

curl -X PUT 'http://localhost:9200/knn-index/' -H 'Content-Type: application/json' -d '
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 2,
        "method": {
          "engine": "faiss",
          "name": "hnsw"
        }
      }
    },
    "dynamic": false
  }
}
' | jq .

2. Bulk indexing.

curl -X POST 'http://localhost:9200/_bulk' -H 'Content-Type: application/json' -d '{ "index": { "_index": "knn-index", "_id": "1" } }
   { "my_vector": [1.5, 2.5], "price": 12.2 }
   { "index": { "_index": "knn-index", "_id": "2" } }
   { "my_vector": [2.5, 3.5], "price": 7.1 }
   { "index": { "_index": "knn-index", "_id": "3" } }
   { "my_vector": [3.5, 4.5], "price": 12.9 }
   { "index": { "_index": "knn-index", "_id": "4" } }
   { "my_vector": [5.5, 6.5], "price": 1.2 }
   { "index": { "_index": "knn-index", "_id": "5" } }
   { "my_vector": [4.5, 5.5], "price": 3.7 }
' | jq .

3. Flush

curl -X POST http://localhost:9200/knn-index/_flush | jq .

# check whether a vector index was created.
find data
./nodes/0/indices/QNt7UpoET0-cahPz9bw1AQ/0/index/_0_165_my_vector.faissc

4. Query

curl -X GET 'http://localhost:9200/knn-index/_search' -H 'Content-Type: application/json' -d '{
  "size": 2,
  "query": {
    "knn": {
      "my_vector": {
        "vector": [2, 3],
        "k": 2
      }
    }
  }
}' | jq .

Logs


We loaded a vector index.

[KDY] ^^^ NativeMemoryLoadStrategy::IndexLoadStrategy ============================
Cache key : _0_165_my_vector.faissc
KNN engine: FAISS
Cache contents : {}
[KDY] $$$ NativeMemoryLoadStrategy::IndexLoadStrategy ============================

5. Close index

curl -X POST http://localhost:9200/knn-index/_close | jq .

Logs

// DocValuesProducer.close() was called.

[KDY] KNN990DocValuesProducer ==============================================
[KDY] ^^^ Cache manager ==============================
{} <------------- Cache is empty!
[KDY] indexPathMap ==============================
{my_vector=_0_165_my_vector.faissc} <------ We're not using the full path anymore.

// Freeing memory.

[KDY] ^^^ JNIService::free ============================
knnEngine: faiss
[KDY] $$$ JNIService::free ============================

5.3. Demo 2. Merge Two Indices

1. Create a new schema

curl -X PUT 'http://localhost:9200/knn-index/' -H 'Content-Type: application/json' -d '
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100,
      "use_compound_file": false
    }
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 2,
        "method": {
          "engine": "faiss",
          "name": "hnsw"
        }
      }
    },
    "dynamic": false
  }
}
' | jq .

2. Bulk Indexing

curl -X POST 'http://localhost:9200/_bulk' -H 'Content-Type: application/json' -d '{ "index": { "_index": "knn-index", "_id": "1" } }
   { "my_vector": [1.5, 2.5], "price": 12.2 }
   { "index": { "_index": "knn-index", "_id": "2" } }
   { "my_vector": [2.5, 3.5], "price": 7.1 }
   { "index": { "_index": "knn-index", "_id": "3" } }
   { "my_vector": [3.5, 4.5], "price": 12.9 }
   { "index": { "_index": "knn-index", "_id": "4" } }
   { "my_vector": [5.5, 6.5], "price": 1.2 }
   { "index": { "_index": "knn-index", "_id": "5" } }
   { "my_vector": [4.5, 5.5], "price": 3.7 }
' | jq .

3. Flush

curl -X POST http://localhost:9200/knn-index/_flush | jq .

Index file : ./nodes/0/indices/WdehW34KSIOSg5mJGxtYuw/0/index/_0_165_my_vector.faissc

4. Query

curl -X GET 'http://localhost:9200/knn-index/_search' -H 'Content-Type: application/json' -d '{
  "size": 2,
  "query": {
    "knn": {
      "my_vector": {
        "vector": [2, 3],
        "k": 2
      }
    }
  }
}' | jq .

Logs

[KDY] ^^^ NativeMemoryLoadStrategy::IndexLoadStrategy ============================
[KDY] Cache key : _0_165_my_vector.faissc
[KDY] KNN engine: FAISS
[KDY] Cache contents : []
[KDY] $$$ NativeMemoryLoadStrategy::IndexLoadStrategy ============================

5. Make another segment

Repeat step 2, 3, 4

Logs


// Closing the first segment. It was already merged, no longer in use.

[KDY] KNN990DocValuesProducer ==============================================
[KDY] Cache manager ==============================
[KDY] []
[KDY] indexPathMap ==============================
[KDY] {my_vector=_0_165_my_vector.faissc} <--- the first segment.


// Freeing the first one.
[KDY] ^^^ JNIService::free ============================
[KDY] knnEngine: faiss
[KDY] $$$ JNIService::free ============================


// Loading the merged one
[KDY] ^^^ NativeMemoryLoadStrategy::IndexLoadStrategy ============================
[KDY] Cache key : _1_165_my_vector.faissc
[KDY] KNN engine: FAISS
[KDY] Cache contents : [] <---------- cache is empty now, we will put the merged one.
[KDY] $$$ NativeMemoryLoadStrategy::IndexLoadStrategy ============================

6. Close index

curl -X POST http://localhost:9200/knn-index/_close | jq .

Logs

// DocValuesProducer.close was called!

[KDY] KNN990DocValuesProducer ==============================================
[KDY] Cache manager ==============================
[KDY] [] <---------- After closing, there is no entry in the cache.
[KDY] indexPathMap ==============================
[KDY] {my_vector=_1_165_my_vector.faissc} <------- the merged one.

@navneet1v
Copy link
Collaborator

navneet1v commented Sep 27, 2024

@0ctopus13prime thanks for adding the idea of removing the file watcher. I like the idea presented here except this

Instead, I propose to copy KNN80DocValuesFormat to KNN990DocValuesFormat and change the constructor where it is the only place casting to FSDirectory.

Even after the competition of proposal we will still be left with FSDirectory dependency. Is this something that will be removed when you fix the write path? or this dependency will still be there?

Another point is, with 2.17 version of opensearch we moved the Vectors getting stored as BinaryDocValues to FloatVectorValues/ByteVectorValues, so if we really want to remove the FileWatcher from code we have to fix this NativeEngines990KnnVectorsFormat -> Reader too so that when the readers are getting closed we can remove the graphs from cache.

Another thing is, we should also remove the filePath as key from the cache and move to a better key which we can generate on the DVProducers and KNNVectorFormatReaders per vector field, to ensure we can remove graphs easily without taking any dependency on FSDirectory.

@vamshin vamshin added v2.18.0 Roadmap:Vector Database/GenAI Project-wide roadmap label labels Sep 27, 2024
@0ctopus13prime
Copy link
Contributor

@navneet1v
Sounds good. I think we can take a similar approach to clean up resource in the reader as well. I will follow-up this ticket.

And for this Another thing is, we should also remove the filePath as key from, in the proposal, it will use an unique logical file name, so I expect in the coming up changes there won't be a full absolute file path being used as a key in cache.

These are my todo list:

  1. Modify KNN80 rather than copying it.
  2. Make a same change in NativeEngines990KnnVectorsFormat.

After these action items, there won't be FSDirectory dependencies.

@jmazanec15
Copy link
Member Author

@0ctopus13prime I think overall proposal looks good. If you have code, could you raise a draft PR so I can take a closer look?

@0ctopus13prime
Copy link
Contributor

@jmazanec15
Sure, after this PR was merged, then will raise it shortly.

@0ctopus13prime
Copy link
Contributor

@jmazanec15
Raised PR for code sharing, it's not yet ready for production.
Will add testings into that.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications Roadmap:Vector Database/GenAI Project-wide roadmap label v2.18.0
Projects
Status: New
Status: 2.18.0
Development

No branches or pull requests

8 participants