Will there be a difference in search speed between the daily and monthly partitions? #4840

rky0930 · 2021-03-23T05:28:56Z

rky0930
Mar 23, 2021

Will there be a difference in search speed between the daily and monthly partitions?

I think if i do a daily partition, there is data that has not reached index_file_size at the end of the partition every day, and an index for that data is not created, so it is expected that one segment without an index will be created every day.

If you plan to keep data for 90 days, there will be 90 segments without indexes, and 3 for monthly partitions.

It's a situation where data is constantly growing. 1.5M vectors will be added per day, dimension 1536

Will the daily and monthly partitions have a lot of performance impact?
If the difference is not big, I would like to use daily.

Answered by yhmo

Mar 23, 2021

1.5M vectors, dimension=1536
For each daily partition, data size = 154641.5M=9.3G.
For each monthly partition, data size = 30*9.3G=279G.

Assume all index are created successfully.
Assume cache size is larger than monthly partition's index file size.
Assume your CPU cores is enough to do parallel computing.
The query performance will mainly depends on these factors:

how many segments(depends on index_file_size)
index parameters (for IVF, nlist)
search parameters(for IVF, nprobe)

For example:
index_file_size = 4GB, each daily partition has 3 segments, each monthly partition has 70 segments. Each segment contains 4GB/1546/4=650000 vectors.
nlist=10000, each segment has 10000 central vector…

View full answer

yhmo · 2021-03-23T07:03:22Z

yhmo
Mar 23, 2021
Collaborator

1.5M vectors, dimension=1536
For each daily partition, data size = 154641.5M=9.3G.
For each monthly partition, data size = 30*9.3G=279G.

Assume all index are created successfully.
Assume cache size is larger than monthly partition's index file size.
Assume your CPU cores is enough to do parallel computing.
The query performance will mainly depends on these factors:

how many segments(depends on index_file_size)
index parameters (for IVF, nlist)
search parameters(for IVF, nprobe)

For example:
index_file_size = 4GB, each daily partition has 3 segments, each monthly partition has 70 segments. Each segment contains 4GB/1546/4=650000 vectors.
nlist=10000, each segment has 10000 central vectors and 10000 buckets, each bucket contains almost 65 vectors.
nprobe=100, each query will get 100 buckets.

We say two vectors distance calculation is a compute unit. For IVF index, firstly the search engine calculate distance between target vector and each central vectors, secondly get the top 100(nprobe) buckets, calculate distance between target vector and each vector in the buckets. So, for each segment, the compute workload is nlist+nprobe*bucket_vector_count

Now let estimate the compute workload for daily/monthly partitions:
For daily partition, workload = segment_count*(nlist+nprobebucket_vector_count) = 3 * (10000 + 10065) = 49500 units
For monthly partition, workload = segment_count*(nlist+nprobebucket_vector_count) = 70 * (10000 + 10065) = 1155000 units

You plan to keep 90 daily partitions and 3 monthly partitions.
(1) Use daily partitions to search, the total workload is: 90 * 49500 = 4455000 units
(2) Use monthly partitions to search, the total workload is: 3 * 1155000 = 3465000 units

Now we know the (2) compute workload is less than (1), so we say (2) is better than (1).
But, this is only a raw estimation, you need to do more practice to verify it.

3 replies

rky0930 Mar 23, 2021
Author

@yhmo Thank you for your kind answer !

May I ask one more question?
In the calculation above, you used index_file_size as 4Gb. Is it because it is a good use-case to set 4Gb index_file_size in this case?
I was going to use 256 MB or 512 MB by following this docs with index type ivf_pq.

yhmo Mar 23, 2021
Collaborator

@yhmo Thank you for your kind answer !

May I ask one more question?
In the calculation above, you used index_file_size as 4Gb. Is it because it is a good use-case to set 4Gb index_file_size in this case?
I was going to use 256 MB or 512 MB by following this docs.

No, I use 4GB just for easy calculation.
The document give general recommendation for two main scenario: "streaming insert"(continually/frequently insert) and static search(fix data size, frequently search). For "stream insert", in a long time, the tail segment size is less than index_file_size, if user performs search, the tail segment will be brute-force search, so we recommend the index_file_size be 256MB or 512MB, so that the tail segment can build index earlier. But I think, 256MB/512MB may get worse performance in some situations, you can calculate the workload of each query for you scenario, estimate the performance for index_file_size=256/512/1024/2048MB, to decide a proper value.
For a segment without index, the compute workload is segment_vector_count units.

rky0930 Mar 23, 2021
Author

@yhmo okay. Thanks for answering!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will there be a difference in search speed between the daily and monthly partitions? #4840

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Will there be a difference in search speed between the daily and monthly partitions? #4840

rky0930 Mar 23, 2021

Replies: 1 comment · 3 replies

yhmo Mar 23, 2021 Collaborator

rky0930 Mar 23, 2021 Author

yhmo Mar 23, 2021 Collaborator

rky0930 Mar 23, 2021 Author

rky0930
Mar 23, 2021

Replies: 1 comment 3 replies

yhmo
Mar 23, 2021
Collaborator

rky0930 Mar 23, 2021
Author

yhmo Mar 23, 2021
Collaborator

rky0930 Mar 23, 2021
Author