Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Cost estimation #5

Open
VoVAllen opened this issue Sep 29, 2024 · 3 comments
Open

feat: Cost estimation #5

VoVAllen opened this issue Sep 29, 2024 · 3 comments
Assignees
Milestone

Comments

@VoVAllen
Copy link
Member

Ref: https://github.com/pgvector/pgvector/blob/158d9340bca5796d8f98182e5a65356bac676b74/src/ivfflat.c#L64

@VoVAllen VoVAllen added this to the v0.1.0 milestone Sep 29, 2024
@VoVAllen VoVAllen assigned cutecutecat and unassigned cutecutecat Nov 29, 2024
@cutecutecat cutecutecat self-assigned this Dec 9, 2024
@cutecutecat
Copy link
Member

cutecutecat commented Dec 9, 2024

We show the cost estimation and actual time cost by EXPLAIN ANALYZE at dataset openai-500k and its subset.
dim=1536

size(list=4096 prob=200) seq cost estimate seq actual time index cost estimate index actual time
500, 000 11307.36 2290.954 0.0 6.594
100, 000 3508.43 567.231 0.0 5.996
50, 000 1759.54 310.085 0.0 5.476
1, 000 ~53 3.765 0.0 ~2.1
10 (list=4 probe=2) ~53 0.163 0.0 ~2.1

As we can see, at size 10 seq scan is better than index scan, but at size 1000, index scan can overtake it. However, the cost estimation of seq scan is always the same: about 53 at this small data situation. So it is impossible to find an expression of index cost estimator which should be <53 at size 1000 and >53 at size 10.

The most acceptable way is to take both size 1000 and 10 as an index scan, which might sacrifice some QPS at size range from 1 to 1000. But to be honest, any formula meet this condition is sured to be equivalence to the current implementation.

As a conclusion, as the vchordrq index is much more efficient than ivf, we might not need a cost estimation like pgvector.

@VoVAllen
Copy link
Member Author

VoVAllen commented Dec 9, 2024

What's the first column "size"?

@cutecutecat
Copy link
Member

cutecutecat commented Dec 10, 2024

What's the first column "size"?

Number of vector rows for inserting to the database. We select the first 10, 1000, 50000 and 100000 rows from the whole 500000 openai database as a subset, to validate the cost estimator.

@VoVAllen VoVAllen reopened this Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants