Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HNSW vector storage #18

Conversation

NumberChiffre
Copy link
Collaborator

@NumberChiffre NumberChiffre commented Aug 28, 2024

Description

TLDR; fast querying with slow insertion, using hnswlib faster than the HNSW from faiss implementation.

Updates

  • Added HNSWVectorDBStorage.
  • Added basic unit tests.
  • Added basic benchmarking against NanoVectorDBStorage.
  • Added an example under examples with GraphRAG (Easily triggering rate limit but unrelated to this PR).

Benchmark results

> python benchmarks/hnsw_vs_nano_vector_storage.py 
Running NanoVectorDB benchmark...
INFO:nano-graphrag:Creating working directory ./nano_graphrag_cache_benchmark_hnsw_vs_nano_vector_storage
INFO:nano-graphrag:Load KV full_docs with 0 data
INFO:nano-graphrag:Load KV text_chunks with 0 data
INFO:nano-graphrag:Load KV llm_response_cache with 0 data
INFO:nano-graphrag:Load KV community_reports with 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1024, 'metric': 'cosine', 'storage_file': './nano_graphrag_cache_benchmark_hnsw_vs_nano_vector_storage/vdb_entities.json'} 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1024, 'metric': 'cosine', 'storage_file': './nano_graphrag_cache_benchmark_hnsw_vs_nano_vector_storage/vdb_benchmark_nano.json'} 0 data
Benchmarking nano...
nano Benchmark:   0%|                                                                                                                                                                                                      | 0/100000 [00:00<?, ?it/s]INFO:nano-graphrag:Inserting 100000 vectors to benchmark_nano
nano Benchmark: 100101it [00:04, 22470.39it/s]                                                                                                                                                                                                        
nano - Insert: 1.44s, Save: 1.81s, Avg Query: 0.0120s

Running HNSWVectorStorage benchmark...
INFO:nano-graphrag:Load KV full_docs with 0 data
INFO:nano-graphrag:Load KV text_chunks with 0 data
INFO:nano-graphrag:Load KV llm_response_cache with 0 data
INFO:nano-graphrag:Load KV community_reports with 0 data
INFO:nano-vectordb:Init {'embedding_dim': 1024, 'metric': 'cosine', 'storage_file': './nano_graphrag_cache_benchmark_hnsw_vs_nano_vector_storage/vdb_entities.json'} 0 data
INFO:nano-graphrag:Created new index for benchmark_hnsw
Benchmarking hnsw...
hnsw Benchmark:   0%|                                                                                                                                                                                                      | 0/100000 [00:00<?, ?it/s]INFO:nano-graphrag:Inserting 100000 vectors to benchmark_hnsw
hnsw Benchmark: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████hnsw Benchmark: 100101it [01:24, 1187.70it/s]                                                                                                                                                                                                         
hnsw - Insert: 83.67s, Save: 0.44s, Avg Query: 0.0017s

Benchmark Results:
NanoVectorDB - Insert: 1.44s, Save: 1.81s, Avg Query: 0.0120s
HNSWVectorStorage - Insert: 83.67s, Save: 0.44s, Avg Query: 0.0017s

@gusye1234
Copy link
Owner

Great Work and thank you!
I'm curious, does hnswlib support Windows?

@NumberChiffre
Copy link
Collaborator Author

NumberChiffre commented Aug 28, 2024

Great Work and thank you! I'm curious, does hnswlib support Windows?

No access to Windows on my part, but it should work:
https://github.com/nmslib/hnswlib?tab=readme-ov-file#for-developers

Can confirm this works on M3 Mac and worked on Ubuntu 20.04.

**Why don't we use docker containerization with docker, docker-compose, and even poetry?

@gusye1234
Copy link
Owner

Great Work and thank you! I'm curious, does hnswlib support Windows?

No access to Windows on my part, but it should work: https://github.com/nmslib/hnswlib?tab=readme-ov-file#for-developers

Can confirm this works on M3 Mac and worked on Ubuntu 20.04.

**Why don't we use docker containerization with docker, docker-compose, and even poetry?

Cool! Few responses:

  • I don't think add a docker image for current nano-graphrag is urgent. This project should be nano and light-weight, not something must run in a container.
  • Is poetry a better option for dependency management? I'm really lack of understanding of this. Maybe we should move to poetry.

@NumberChiffre
Copy link
Collaborator Author

NumberChiffre commented Aug 28, 2024

Great Work and thank you! I'm curious, does hnswlib support Windows?

No access to Windows on my part, but it should work: https://github.com/nmslib/hnswlib?tab=readme-ov-file#for-developers
Can confirm this works on M3 Mac and worked on Ubuntu 20.04.
**Why don't we use docker containerization with docker, docker-compose, and even poetry?

Cool! Few responses:

  • I don't think add a docker image for current nano-graphrag is urgent. This project should be nano and light-weight, not something must run in a container.
  • Is poetry a better option for dependency management? I'm really lack of understanding of this. Maybe we should move to poetry.

Sounds good!

poetry works well with docker and in general is more effective than pip (handling complex dependency trees more effectively than pip).

*** I'm done with commits for this PR, let me know what you think 👍

@gusye1234
Copy link
Owner

Cool, I have only one review about this PR and maybe you can look at it.
Then we're all set.

Copy link
Owner

@gusye1234 gusye1234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -151,6 +152,7 @@ def __post_init__(self):
global_config=asdict(self),
embedding_func=self.embedding_func,
meta_fields={"entity_name"},
**self.vector_db_storage_cls_kwargs,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why unpack vector_db_storage_cls_kwargs here?
You get this dict from global_config in HNSWStorage.
Unpack vector_db_storage_cls_kwargs here maybe break other vector storage's initialization.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fixed by 20219d0

@zhaofangtao
Copy link

The library "hnswlib" can be installed and used in the Windows 11 environment.
c75fab07a31ca47ac3408bd4040b53e

@gusye1234 gusye1234 merged commit c5dd0d8 into gusye1234:main Aug 28, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants