-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for unlogged tables (#125) #196
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great first PR! Well done, Choi (@juuu-jiii) !
numBlocks = RelationGetNumberOfBlocks(heap); | ||
break; | ||
default: // should be case INIT_FORKNUM, but set to default to keep compiler quiet | ||
numBlocks = RelationGetNumberOfBlocksInFork(index, forkNum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this case not work for the main forknum as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- RelationGetNumberOfBlocks is a preprocessor macro that calls RelationGetNumberOfBlocksInFork and passes in MAIN_FORKNUM as the second argument, so I found it necessary to call RelationGetNumberOfBlocks and manually pass in forkNum (which would store INIT_FORKNUM in this case). In hindsight, I could have just hardcoded the second argument to INIT_FORKNUM as well to make my intention more explicit (so that I'd call RelationGetNumberOfBlocksInFork(index, forkNum) instead). Would that be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the whole switch
statement could just be replaced with numBlocks = RelationGetNumberOfBlocksInFork(index, forkNum);
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right! That's much, much cleaner. I'll have the fixes made and changes pushed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on a minute, forget what I said earlier. The switch statement can't be replaced with numBlocks = RelationGetNumberOfBlocksInFork(index, forkNum);
-- when forkNum == MAIN_FORKNUM
, the relation we pass in is heap
, but when forkNum == INIT_FORKNUM
, the relation we pass in instead is index
(heap
references NULL).
In other words, when forkNum == MAIN_FORKNUM, we run the equivalent of numBlocks = RelationGetNumberOfBlocksInFork(heap, forkNum);
When forkNum == INIT_FORKNUM
, we run numBlocks = RelationGetNumberOfBlocksInFork(index, forkNum);
LanternBench("build hnsw index", ScanTable(buildstate)); | ||
|
||
if(buildstate->heap != NULL) { | ||
LanternBench("build hnsw index", ScanTable(buildstate)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not run for unlogged tables?
If it does not run, how is the index structure populated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ScanTable can run for unlogged tables because it calls table_index_build_scan, which accesses data in buildstate->heap (buildstate->heap references NULL in the case of unlogged tables, leading to a runtime exception when accessed). Instead, I populated the IndexInfo struct by calling BuildIndexInfo at the beginning of ldb_ambuildunlogged.
Great PR. Thanks for working on this issue. There are a few things I think might need to be modified. I notice in one of your tests the index initialized in the init fork is non-empty. This appears to be because it's initialized from an external index file. Since the underlying table is unlogged in the event that the database crashes this will result in the table being empty but the index on it remaining populated which is not the expected behavior. More generally It's probably worth factoring out the code to build an empty index instead of calling It would also be nice to test that when the database crashes, the index in the init fork is loaded and can have vectors inserted into it. There isn't a great way to induce this within postgres at the moment. Maybe adding a function that does what Finally, it seems like we shouldn't be writing to the WAL if an index's underlying table in unlogged. I'm not sure what would happen in a crash. I think the fork would just get overwritten with the init fork so it wouldn't cause issues, but since it's not expected it's probably best to avoid. It would also marginally improve performance for this use case which is nice |
@ezra-varady Thank you for the feedback! I'll factor the unlogged index code out of Re:WAL, nothing is saved there if we are dealing with an unlogged table. Postgres performs checks under the hood to see whether the table in question is unlogged, and if it is, it skips xlog-related logic even if a WAL command is issued. I discovered this when looking at documentation as well as how other implementations of the |
Addressed by #253 |
This PR adds support for unlogged tables, ensuring that Lantern functionality works correctly even if the current table is not saved to WAL. A corresponding test (and associated util testing scripts) have been added to make test to verify that unlogged table functionality behaves as expected.
NOTE: Currently the test checks inserts and index creation on an empty unlogged table as well as one retrieved from file. It also verifies that the created index works and that distance functions work on unlogged tables. Let me know if there are other cases that should be covered, and I'll be happy to add them!