Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieval metrics are misleading - here's how it should be done instead #2611

Open
devinbost opened this issue Jun 26, 2024 · 6 comments
Open
Labels
enhancement New feature or request
Milestone

Comments

@devinbost
Copy link

devinbost commented Jun 26, 2024

🚀 Request

Retrieval metrics should be more aligned with typical practices. Recommendation is below.

Explanation

The current best practice for calculating retrieval metrics follows this process:

  1. Calculate ground truth search labels. In the case of vector search, this would be via exact KNN search.
  2. Calculate predicted results. In the case of vector search, this would be retrieving ANN search results.
  3. Calculate score. The way a predicted result is determined to be relevant is determined by whether it appears in the ground truth list. Some implementations also consider the position.

Example (vector search recall)

In the case of top k recall for vector search:

  1. obtain the top k exact KNN results for the given query
  2. obtain the top k ANN results for the given query
  3. Calculate the percentage of ANN results that appear in the KNN result list
  4. Repeat for every query in the test set

Current implementation

The current implementation in torchmetrics expects indexes (corresponding to queries), predictions (probabilities used to rank, which could be similarity scores), targets (which are supposed to be ground truth labels).

What's misleading:

The problem is that the current implementation assumes that a 1:1 mapping exists between predictions and ground truth labels, which does not align with the industry practice.

How it should work instead

The method signature should look more like this:
query/index (tensor), true_preds (tensor), true_targets (tensor), actual_preds (tensor), actual_targets (tensor), threshold (optional float), epsilon (optional float), similarity/distance (bool, defaults to distance)

With this data, the score calculation could be based on either:

  1. count. If the preds are distinct, then top k could be obtained by:
    a. sorting (true_preds, true_targets) by true_preds and filtering to top k items,
    b. sorting (actual_preds, actual_targets) by actual_preds and filtering to top k items
    c. performing the computation between the lists for each given query/index

  2. threshold. If preds are not distinct, then we would take top values obtained by a threshold, t, obtained by:
    a. sorting (true_preds, true_targets) by true_preds and filtering to top items where true_preds are <= t (if t is a distance) or >= t (if t is a similarity)
    b. sorting (actual_preds, actual_targets) by actual_preds and filtering to top items where actual_preds are <= (1 + epsilon) * t (if t is a distance) or >= (1 - epsilon) * t (if t is a similarity) where epsilon is a modifier to soften the filter on the actual side.
    c. performing the computation between the lists for each given query/index

An implementation like this would enable torchmetrics to calculate Recall, MAP, AUROC, NDCG, MRR, etc., based on industry-accepted practices.

Additional context

The well-known ANN-Benchmarks paper goes into detail on the recall calculation used in the example here.

image
image

(Aumüller, M., Bernhardsson, E., & Faithfull, A. (2020). ANN-Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87, 101374. Available at: https://arxiv.org/pdf/1807.05614.pdf )
 

@devinbost devinbost added the enhancement New feature or request label Jun 26, 2024
@lantiga
Copy link
Contributor

lantiga commented Jun 27, 2024

Thanks for the report @devinbost, I totally see where you're coming from.

@SkafteNicki let's prioritize this issue. What's your thought on providing the additional signature?

@SkafteNicki
Copy link
Member

Hi @devinbost, thanks for raising this issue.
Tagging @lucadiliello for opinions, because he has implemented most of the retrieval metrics.
I am more than willing to change the implementation based on best practice in the industry. Have only worked shortly on retrieval problems, so not a expert.

The only problem in my view we have to solve is the backwards compatibility. I am going to assume that some of our user base are using the retrieval metrics as they are right now, as this is only raised as a issue now (the alternative is that non are using them). Thus, I would preferably support both cases. @devinbost does the new propose interface do this, or can we discuss how to make things backwards compatible.

@lantiga
Copy link
Contributor

lantiga commented Jun 28, 2024

They are definitely used: https://github.com/search?q=RetrievalMRR&type=code

So we need to come up with a backwards compatible design. We could add a new submodule, or a new set of classes in the retrieval submodule. Given the different signatures I wouldn't complicate the existing classes, but that's me.

@SkafteNicki @lucadiliello @devinbost what would be a good name for the submodule that is not retrieval? Is search a good enough one?

@devinbost
Copy link
Author

I deeply appreciate all the attention on this!!

@lantiga search is a good name. The implementation that I recommended above would definitely apply in search use cases.

@SkafteNicki
Copy link
Member

@devinbost Is there any of the metrics from the retrieval domain you are particular interested in? Just to prioritize what to implement first

@devinbost
Copy link
Author

@SkafteNicki The first one would be recall. After that, MRR and then NDCG. Those the the most commonly used metrics in the industry and literature.

@SkafteNicki SkafteNicki added this to the v1.5.0 milestone Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants