-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[py-tx] Implement a new cleaner PDQ index solution from scratch #1613
Comments
I will start a fix on this issue at the Hackathon |
@Dcallies I plan to take this issue after I finish with the pdq rotation. Could you help me divide this issue into smaller sub problems for me to work on when you have time? |
Yes - do you need them as issues or just to write them out? Steps:
|
@Dcallies do you want me to start swapping out the index class that the PDQ signal type uses by default? |
Not quite yet, we're missing the optimized solution - faiss IVF. IVF faiss indices should be used when the number of hashes are above some number (e.g. 1,000 hashes), and the selection should be implemented in build() based on the initial input size. |
When we built the PDQ index, it was our first attempt, and we made a lot of strange/bad choices.
Namely:
I think we could provide a second implementation that is a lot simpler, which we could then find a way to swap.
They key elements:
Pass in the index type as an argument during construction
Simplify the stored state of the index implementation
Use a simpler inner wrapper to handle some of the PDQ details
class _PDQHashIndex:
"""
A wrapper around the faiss index for pickle serialization
"""
Putting it together with search
Dynamically selecting lookup type from build function
Test everything
Add a robust set of unittests for this functionality
Rollout plan
After we confirm that everything is working as expected, we'll swap out the index class that the PDQ signal type uses by default. I think we can get away without a major version bump for this.
The text was updated successfully, but these errors were encountered: