Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented AnnoyIndex serialization to bytes objects in-memory #661

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

FuexFollets
Copy link

@FuexFollets FuexFollets commented Jan 25, 2024

Despite the fact that AnnoyIndex had methods to save and load the state of an index to a file, there was not any method to serialize the state of the index without saving to a file. This is important because you may want to use an AnnoyIndex object as a field in another class, and also be able to efficiently serialize the class that contains the index. The main benefit of this is that some serialization libraries may be able to serialize the aggregate class more efficiently in terms of size. This also avoids other complications such as file write permissions and searching for an adequate temp directiry and makes it easier to serialize the aggregate overall.

This pull request includes the following:

  • AnnoyIndex serialize and deserialize functions. The serialize method returns bytes which can be passed into the deserialize method to then 'load' the bytes to create the original serialized index.
  • Bindings for serialize and deserialize functions implemented for python, golang, and lua
  • Tests for the functionality and bindings in python, golang, and lua

So far, I have gotten all tests to pass in addition to the tests which I have implemented in each language. The tests that I have implemented simply test the result of get_nns_by_item before and after serializing and deserializing an index. Unless there is anything to change that I have overseen, it should be ready.

@erikbern
Copy link
Collaborator

Thanks for the contribution! Since Annoy stores all data in contiguous memory, wouldn't it be enough to just copy that memory buffer and then load from that? I'm not sure if we need to reimplement the logic to find the roots.

Also what happens if you serialize an index that has mmapped a file? We should probably disallow that right?

@FuexFollets
Copy link
Author

@erikbern I have changed some of the code based on your suggestion to instead store the roots and other information in the serialized bytes so it does not need to be recalculated. I tested if serialization will still work properly even if the index has been loaded from a file by mmapping, and it seems to still work just fine based on the fact that the serialization result is the same. The main logic that is being implemented here is putting together the buffers that are pointed to within the AnnoyIndex class all into a single contiguous byte sequence which is later turned back into an index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants