Postpone index build for empty table to first insert #209
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #198
In the event of an index being built on an empty table where no dimension is specified in the options (using
CREATE INDEX... WITH (dim=k)
), we postpone the creation of this index to the first insertion, where we can get the dimension of the inserted vector and then build the index. This improves user UX as outlined in the issue.I updated tests, and expanded on the
hnsw_create
test to test the above functionality.Notes/Observations:
0. One thing to consider is what might happen when we have concurrent inserts/calls to aminsert when the index build is postponed. We don't want to trigger the index build more than once accidentally. Perhaps we should pass a Boolean state that marks the progress of a postponed index build and check against it? I have yet to look into Postgres' concurrency model. Feel free to share ideas.
The
BuildCallback
immediately returns after the first tuple is read (explained in the comment there). This was done for simplicity's sake. But this also means that postgres is pointlessly calling this callback for every other tuple in the heap that exists at that time, since each callback after the first one will immediately return. So a future optimization may be, on postponed builds, to only read the first row and write it to the index without performing the full heap scan with the callback. However, note that this is only relevant in a specific scenario: when we construct an index on an empty table, specify no dimension, and do a bulk insert of some sort as part of the first insert (like\COPY
on acsv
file), which is when the heap would have more than one tuple at the time of building the postponed index. Since this scenario is pretty rare, we already have a simple implementation, and we are not guaranteed significant performance boosts if we opt for the alternative, I think that the current approach is well worth it.I noticed that the original codebase, when inferring a dimension (via the
GetArrayLengthFromHeap
method), we ditch inference entirely upon encountering aNULL
vector. This means that we're not able to inference when the first encountered tuple isNULL
but there are other non-NULL tuples. Wanted to point this out-- I don't think it would be a hard change: we can just keep going until we find a non-NULL vector or reach the end.I believe some of the code in
GetHnswIndexDimensions
can be refactored. It basically tries to do the same thing asInferDimension
, and both useGetArrayLength
. Idea is to haveGetHnswIndexDimensions
just callInferDimension
and then refactor theInitBuildState
as well, since it is currently calling both of these methods under certain circumstances which leads to the same operation effectively being done twice. But the way thatGetHnswIndexDimensions
andInferDimension
useGetArraylength
is also slightly different. Is the relevant part ofGetHnswIndexDimensions
andInferDimension
trying to accomplish the same thing? Curious to hear thoughts on this.