Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: implement XTR for retrieving multivector #3437

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

BubbleCal
Copy link
Contributor

this PR introduces XTR, which can score the documents without the original multivector, so we don't need any IO op for searching on multivector.

it sets the minimum similarity as the estimated similarity for missed documents of single query vector.

Signed-off-by: BubbleCal <[email protected]>
@BubbleCal BubbleCal marked this pull request as ready for review February 8, 2025 08:10
@codecov-commenter
Copy link

codecov-commenter commented Feb 8, 2025

Codecov Report

Attention: Patch coverage is 85.31746% with 37 lines in your changes missing coverage. Please review.

Project coverage is 78.80%. Comparing base (cca98fc) to head (940fb85).

Files with missing lines Patch % Lines
rust/lance/src/io/exec/knn.rs 85.44% 24 Missing and 7 partials ⚠️
rust/lance/src/dataset/scanner.rs 79.31% 1 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3437      +/-   ##
==========================================
- Coverage   78.82%   78.80%   -0.03%     
==========================================
  Files         251      251              
  Lines       92866    93075     +209     
  Branches    92866    93075     +209     
==========================================
+ Hits        73202    73344     +142     
- Misses      16686    16751      +65     
- Partials     2978     2980       +2     
Flag Coverage Δ
unittests 78.80% <85.31%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

.map(|v| distance_type.func()(q, v))
.min_by(|a, b| a.partial_cmp(b).unwrap())
.map(|v| 1.0 - distance_type.func()(q, v))
.max_by(|a, b| a.total_cmp(b))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changing these so that the flat search results can be the same with IVF_FLAT, so the numbers won't confuse users

_ => unreachable!(),
};

let mut knn_node = if q.refine_factor.is_some() || is_multivec {
let mut knn_node = if q.refine_factor.is_some() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't require refine for multivector search any more

@@ -1061,7 +1063,7 @@ mod tests {
let gt = multivec_ground_truth(&vectors, &query, k, params.metric_type);
let gt_set = gt.iter().map(|r| r.1).collect::<HashSet<_>>();

let recall = row_ids.intersection(&gt_set).count() as f32 / 10.0;
let recall = row_ids.intersection(&gt_set).count() as f32 / 100.0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the recall was calculated incorrectly, the previous algo requires refine_factor=5 to reach good enough recall

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something seems off in the algorithm, with how missed_similarities is handled. Could you address my comment, and also maybe write a unit tests that shows we get correct results? out of this?

Comment on lines +739 to +740
let row_ids = batch[ROW_ID].as_primitive::<UInt64Type>();
let dists = batch[DIST_COL].as_primitive::<Float32Type>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using values here, can we add a debug assert that there are non nulls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

// at most, we will have k * refine_factor results for each query
let mut results = HashMap::with_capacity(k * refactor);
let mut missed_similarities = 0.0;
while let Some((min_sim, batch)) = reduced_inputs.try_next().await? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the algorithm in the paper deeply, but it seems odd to me that the order of the ANN queries matters. It appears that later batches will be adding a higher missed_similarities value. Is that intentional?

It looks like the output order of select_all isn't deterministic. https://docs.rs/futures/latest/futures/stream/fn.select_all.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's intentional, this is a little bit complicated, will add more comments about this:
considering we are updating the final results with a batch from a query vector, and for a row x:

  • if x exists in results but not batch: set min_sim as the estimated similarity, the contribution is min_sim
  • if x exists in both, then the contribution is sim in batch
  • if x exists in only batch, this means all queries before missed this row, this algo maintains missed_similarities as the sum of min_sim so far, so the contribution is missed_similarities + sim

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. It makes more sense now that I know what part was missing.

Signed-off-by: BubbleCal <[email protected]>
@BubbleCal
Copy link
Contributor Author

Something seems off in the algorithm, with how missed_similarities is handled. Could you address my comment, and also maybe write a unit tests that shows we get correct results? out of this?

we have tests here https://github.com/lancedb/lance/pull/3437/files#diff-6de816b72e7c722316243c57df4f809ad34dc8581367c72335154dada48c40edL993

@BubbleCal BubbleCal requested a review from wjones127 February 11, 2025 07:35
Signed-off-by: BubbleCal <[email protected]>
@wjones127
Copy link
Contributor

Something seems off in the algorithm, with how missed_similarities is handled. Could you address my comment, and also maybe write a unit tests that shows we get correct results? out of this?

we have tests here https://github.com/lancedb/lance/pull/3437/files#diff-6de816b72e7c722316243c57df4f809ad34dc8581367c72335154dada48c40edL993

I meant more text the XTR algorithm itself was working as expected. Part of why I'm having a hard time understand this PR is there are no tests showing the expected behavior of the algorithm.

Comment on lines +792 to +796
// for a row `r`:
// if `r` is in only `results``, then `results[r] += min_sim`
// if `r` is in only `query_results`, then `results[r] = query_results[r] + missed_similarities`,
// here `missed_similarities` is the sum of `min_sim` from previous iterations
// if `r` is in both, then `results[r] += query_results[r]`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still having trouble understanding the logic.

So let's imagine batch 1 is:

  • rowid: 1, distance: 2.0
  • rowid: 2, distance 3.0

And batch 2 is:

  • rowid: 2, distance: 1.0
  • rowid: 3, distance: 0.5

If batch 1 comes in first, then we should get:

  • rowid 1, distance = 2.0 + 2.0 = 4.0 (distance in batch 1, plus min sim from batch 1)
  • rowid 2, distance = 3.0 + 1.0 = 4.0 (sum of distances across batches)
  • rowid 3, distance = 0.0 + 0.5 = 0.5 (min distance plus 0.5)

But if batch 2 comes in first, then we will get:

  • rowid 1, distance = 2.0 + 0.5 = 2.5 (min distance of batch 2, plus distance in batch 1)
  • rowid 2, distance = 1.0 + 3.0 = 4.0
  • rowid 3, distance = 0.5 + 0.5 = 1.0

rowid 1 and rowid 3 seem to get completely different distances depending on the order that the queries finish. And there doesn't seem to be any definition of what queries finish first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_sim is from the current batch, missed_similarities is sum of the min_sim of all batches before.

so if batch 1 comes first:

  • rowid 1, sim = 2.0 + 0.5 = 2.5 (sim in batch 1 + min_sim in batch 2)
  • rowid 2, sim = 3.0 + 1.0 = 4.0 (sum of sim)
  • rowid 3, sim = 2.0 + 0.5 = 2.5 (missed_sim_sum (min_sim in batch 1) + sim in batch 2)

if batch 2 comes first:

  • rowid 1, sim = 0.5 + 2.0 = 2.5 (missed_sim_sum (min_sim in batch 2) + sim in batch 1)
  • rowid 2, sim = 3.0 + 1.0 = 4.0
  • rowid 3, sim = 0.5 + 2.0 = 2.5 (sim in batch 2 + min_sim in batch 1)

yeah it's very complicated... i should add some tests to verify that the scores are all the same in any orders

Copy link
Contributor Author

@BubbleCal BubbleCal Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BubbleCal BubbleCal requested a review from wjones127 February 19, 2025 06:33
Signed-off-by: BubbleCal <[email protected]>
Signed-off-by: BubbleCal <[email protected]>
.collect::<Vec<_>>();

let mut res: Option<HashMap<_, _>> = None;
for perm in batches.into_iter().permutations(3) {
Copy link
Contributor Author

@BubbleCal BubbleCal Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether select_all produces the items in the order of streams if all of them are ready, but 6 runs should be enough to verify it even though it produces in random order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants