-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented AnnoyIndex serialization to bytes objects in-memory #661
base: main
Are you sure you want to change the base?
Conversation
Thanks for the contribution! Since Annoy stores all data in contiguous memory, wouldn't it be enough to just copy that memory buffer and then load from that? I'm not sure if we need to reimplement the logic to find the roots. Also what happens if you serialize an index that has mmapped a file? We should probably disallow that right? |
@erikbern I have changed some of the code based on your suggestion to instead store the roots and other information in the serialized bytes so it does not need to be recalculated. I tested if serialization will still work properly even if the index has been loaded from a file by mmapping, and it seems to still work just fine based on the fact that the serialization result is the same. The main logic that is being implemented here is putting together the buffers that are pointed to within the AnnoyIndex class all into a single contiguous byte sequence which is later turned back into an index. |
Despite the fact that AnnoyIndex had methods to save and load the state of an index to a file, there was not any method to serialize the state of the index without saving to a file. This is important because you may want to use an AnnoyIndex object as a field in another class, and also be able to efficiently serialize the class that contains the index. The main benefit of this is that some serialization libraries may be able to serialize the aggregate class more efficiently in terms of size. This also avoids other complications such as file write permissions and searching for an adequate temp directiry and makes it easier to serialize the aggregate overall.
This pull request includes the following:
So far, I have gotten all tests to pass in addition to the tests which I have implemented in each language. The tests that I have implemented simply test the result of
get_nns_by_item
before and after serializing and deserializing an index. Unless there is anything to change that I have overseen, it should be ready.