Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytx] Implement a new cleaner PDQ index solution #1698

Merged
merged 8 commits into from
Nov 25, 2024

Conversation

haianhng31
Copy link
Contributor

@haianhng31 haianhng31 commented Nov 14, 2024

Summary

Resolve issue #1613 .
This PR introduces a new PDQIndex2 class for managing and querying a PDQ hash-based index using FAISS. Key changes include:

  • _PDQHashIndex: A wrapper around the FAISS index to handle adding hashes and performing searches using the FAISS library.
  • PDQIndex2: A class for managing a PDQ index, which includes:
    • The ability to add PDQ hashes and associate them with entries.
    • A query method for searching the index for matching entries based on a query hash.
    • A serialize and deserialize method for persisting and loading the index from binary streams using pickle.

Test Plan

I have included several test cases for this, currently all in one file.

For the bug in issue #1318, I have created 2 test cases relating to it:

Also ran test_pdq_index with 1,000 banked & 10,000 queries in the unit test. The test passed but it took 4 minutes to run.
And if i tried with 10,000 banked & 100,000 queries, the test took forever.

@Dcallies Dcallies changed the title [pytx] Implement a new cleaner PDQ index solution (2) [pytx] Implement a new cleaner PDQ index solution Nov 16, 2024
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!

Toplevel comments:

  1. I am not convinced yet that this indexing solution is correct yet. To help convince me, you can write a unittest that generates two large sets of random hashes (PDQSignal.get_random_signal()) b and q (e.g. 1,000 and 10,000), uses the brute force method to check exactly how many of hashes in b should match the hashes in q, then loads the b hashes into your indexing solution, and then queries all of the q hashes and makes sure you exactly get the expected results.
  2. Many of your existing tests are tautological or should check to make sure the matching hashes are actually the ones that should be matching. You can likely replace many of your existing tests with variations on 2 above - I put some ideas for you in the IndexRecord class, but there are others that might work.
  3. I recommend adding a method that can return a random hash exactly X bits different from a given hash, and then using that to help in your tests.

Comment on lines 119 to 126
def add(self, pdq_strings: t.Sequence[str]) -> None:
"""
Add PDQ hashes to the FAISS index.
Args:
pdq_strings (Sequence[str]): PDQ hash strings to add
"""
vectors = self._convert_pdq_strings_to_ndarray(pdq_strings)
self.faiss_index.add(vectors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's somewhat confusing that this is add which takes multiple here, but the other is add which takes singular.

ignorable: I think you can just move this functionality to PDQIndex2

Comment on lines 134 to 136
# Create a near-match by flipping a few bits
base_hash = pdq_hashes[0]
near_hash = hex(int(base_hash, 16) ^ 0xF)[2:].zfill(64) # Flip 4 bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Did you forget to query near_hash?

ignorable: In another similar project, I wrote a helper on the signal type itself to provide a random hash that was X distance away from a given hash by flipping random bits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I query the near_hash on the line below

Comment on lines 42 to 53
pdq_hashes = [
"f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22",
"f" * 64,
"0" * 64,
"a" * 64,
]
index = PDQIndex2[str](
entries=[
("f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22", 0)
]
)
return index, pdq_hashes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Why do you return hashes which aren't in the index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that I can use the other hashes to test for unmatch when needed.
For example, in the test: test_one_entry_sample_index

@haianhng31 haianhng31 requested a review from Dcallies November 18, 2024 19:59
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! We're nearly there, only a few more blocking comments!

Comment on lines 54 to 56
index = PDQIndex2()
for i, base_hash in enumerate(base_hashes):
index.add(base_hash, i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: PDQIndex2.build()

Comment on lines 20 to 23
from threatexchange.signal_type.pdq.pdq_utils import (
BITS_IN_PDQ,
convert_pdq_strings_to_ndarray,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@haianhng31 haianhng31 requested a review from Dcallies November 20, 2024 21:20
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So close!

  1. The way you are seeding your hashes is causing duplicates in your query set - see inline blocking comment
  2. As part of your test plan (doesn't have to be part of the test itself), please try generate a very large number of hashes and queries - e.g. 10,000 banked and 100,000 queries - something that takes 5-10 seconds to check. We want to be certain that the index is correct!
  3. Please add a test that serializes and deserializes a non-empty index - you can do something cheeky like:
original = ... # do stuff to prepare index
deserialized = PDQIndex2,deserialize(index.serialize(...))
for index in (original, deserialized):
   # The original test goes here, should get same results on original or deserialized

Comment on lines 10 to 15
def _generate_sample_hashes(size: int, seed: int = 42):
random.seed(seed)
return [PdqSignal.get_random_signal() for _ in range(size)]


SAMPLE_HASHES = _generate_sample_hashes(100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me!

from threatexchange.signal_type.pdq.pdq_utils import simple_distance


def _generate_sample_hashes(size: int, seed: int = 42):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

42

image

Copy link
Contributor Author

@haianhng31 haianhng31 Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry what do you mean by the image:P ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

42 is the "Answer to the Ultimate Question of Life, The Universe, and Everything" in the book series "The Hitchhiker's Guide to the Galaxy", of which the image is on the cover. I thought you were a fan :P

Read more: https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy

@haianhng31
Copy link
Contributor Author

@Dcallies
I have tried with 1,000 banked & 10,000 queries in the unit test. The test passed but it took 4 minutes to run.
And if i tried with 10,000 banked & 100,000 queries, the test took forever.

@haianhng31 haianhng31 requested a review from Dcallies November 21, 2024 23:08
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All blocking comments resolved!

@Dcallies Dcallies merged commit 2ae7f50 into facebook:main Nov 25, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants