Idea behind xt, xb, xq data split #1769

ljstrnadiii · 2021-03-18T20:06:19Z

ljstrnadiii
Mar 18, 2021

Hello,

Quick question: what is the idea behind xt (training set) vs xb (base set or database set)?

In ML, there is no 'added' set like xb, or at least it is not obvious, so this is new to me. What is the general philosophy behind splitting a dataset into query, base and training dataset? Is training a subset of base? If so, is it just a subset that sufficiently trains the model before adding the rest? If not, do you mind explaining the idea behind the split?

Thanks for any guidance! Cheers!

Answered by beauby

Mar 18, 2021

Basically you want to index a set of vectors (xb) which you want to query with a set of queries (xq). The xtset should be large enough for the index to train in a meaningful way, and have the same distribution as xb. It is not needed that xt and xb be disjoint.

View full answer

beauby · 2021-03-18T20:25:46Z

beauby
Mar 18, 2021

Basically you want to index a set of vectors (xb) which you want to query with a set of queries (xq). The xtset should be large enough for the index to train in a meaningful way, and have the same distribution as xb. It is not needed that xt and xb be disjoint.

4 replies

ljstrnadiii Mar 19, 2021
Author

Great, that is what I figured. Thanks for the clarification!

Are there any references at your fingertips for constructing minimal training sets to maintain characteristics and distribution of the db set in the context of building ivf indexes? No worries if not. Seems like statistics can help us here. I just figured this idea would be very helpful with 100M+ sets.

wickedfoo Mar 20, 2021
Collaborator

Random sampling of your xb set should be fine. Typical training for many Faiss indices would be for the IVF centroids or PQ centroids, in which case >30x or so training vectors per IVF centroid should be sufficient (e.g., if you have an IndexIVFPQ with nlist = 10000, then xq with >300000 vectors randomly sampled from xb should be ok).

wickedfoo Mar 20, 2021
Collaborator

In fact you can pass xb itself, even if massive, as the training set xt and the indices internally will sub-sample it.

It would be a little more complicated if you call add() multiple times with multiple xb, in which case you should subsample itself across all the data that you're eventually adding.

ljstrnadiii Mar 20, 2021
Author

Makes sense to me. Basically, we need to make sure xt has the same distribution we anticipate from xb.

Thanks a ton!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea behind xt, xb, xq data split #1769

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Idea behind xt, xb, xq data split #1769

ljstrnadiii Mar 18, 2021

Replies: 1 comment · 4 replies

beauby Mar 18, 2021

ljstrnadiii Mar 19, 2021 Author

wickedfoo Mar 20, 2021 Collaborator

wickedfoo Mar 20, 2021 Collaborator

ljstrnadiii Mar 20, 2021 Author

ljstrnadiii
Mar 18, 2021

Replies: 1 comment 4 replies

beauby
Mar 18, 2021

ljstrnadiii Mar 19, 2021
Author

wickedfoo Mar 20, 2021
Collaborator

wickedfoo Mar 20, 2021
Collaborator

ljstrnadiii Mar 20, 2021
Author