Add support for unlogged tables (#125) #196

juuu-jiii · 2023-10-10T16:28:02Z

This PR adds support for unlogged tables, ensuring that Lantern functionality works correctly even if the current table is not saved to WAL. A corresponding test (and associated util testing scripts) have been added to make test to verify that unlogged table functionality behaves as expected.

NOTE: Currently the test checks inserts and index creation on an empty unlogged table as well as one retrieved from file. It also verifies that the created index works and that distance functions work on unlogged tables. Let me know if there are other cases that should be covered, and I'll be happy to add them!

Ngalstyan4

Great first PR! Well done, Choi (@juuu-jiii) !

Ngalstyan4 · 2023-10-13T05:57:54Z

src/hnsw/build.c

+                numBlocks = RelationGetNumberOfBlocks(heap);
+                break;
+            default:  // should be case INIT_FORKNUM, but set to default to keep compiler quiet
+                numBlocks = RelationGetNumberOfBlocksInFork(index, forkNum);


does this case not work for the main forknum as well?

No -- RelationGetNumberOfBlocks is a preprocessor macro that calls RelationGetNumberOfBlocksInFork and passes in MAIN_FORKNUM as the second argument, so I found it necessary to call RelationGetNumberOfBlocks and manually pass in forkNum (which would store INIT_FORKNUM in this case). In hindsight, I could have just hardcoded the second argument to INIT_FORKNUM as well to make my intention more explicit (so that I'd call RelationGetNumberOfBlocksInFork(index, forkNum) instead). Would that be better?

So, the whole switch statement could just be replaced with numBlocks = RelationGetNumberOfBlocksInFork(index, forkNum);, no?

Ah, you're right! That's much, much cleaner. I'll have the fixes made and changes pushed!

Hold on a minute, forget what I said earlier. The switch statement can't be replaced with numBlocks = RelationGetNumberOfBlocksInFork(index, forkNum); -- when forkNum == MAIN_FORKNUM, the relation we pass in is heap, but when forkNum == INIT_FORKNUM, the relation we pass in instead is index (heap references NULL).

In other words, when forkNum == MAIN_FORKNUM, we run the equivalent of numBlocks = RelationGetNumberOfBlocksInFork(heap, forkNum);
When forkNum == INIT_FORKNUM, we run numBlocks = RelationGetNumberOfBlocksInFork(index, forkNum);

Ngalstyan4 · 2023-10-13T06:00:32Z

src/hnsw/build.c

-        LanternBench("build hnsw index", ScanTable(buildstate));
+
+        if(buildstate->heap != NULL) {
+            LanternBench("build hnsw index", ScanTable(buildstate));


Does this not run for unlogged tables?
If it does not run, how is the index structure populated?

I don't think ScanTable can run for unlogged tables because it calls table_index_build_scan, which accesses data in buildstate->heap (buildstate->heap references NULL in the case of unlogged tables, leading to a runtime exception when accessed). Instead, I populated the IndexInfo struct by calling BuildIndexInfo at the beginning of ldb_ambuildunlogged.

ezra-varady · 2023-10-18T05:52:31Z

Great PR. Thanks for working on this issue. There are a few things I think might need to be modified.

I notice in one of your tests the index initialized in the init fork is non-empty. This appears to be because it's initialized from an external index file. Since the underlying table is unlogged in the event that the database crashes this will result in the table being empty but the index on it remaining populated which is not the expected behavior. More generally ambuildempty should probably always build an empty index.

It's probably worth factoring out the code to build an empty index instead of calling BuildIndex. BuildIndex is doing work that isn't strictly necessary e.g. initializing a usearch index, plus modifying the function to support this code path adds some complexity. At a minimum I would probably call StoreExternalIndex instead, though you can probably just copy some of the code from this since it also requires constructing a usearch index. Thinking more about this you may also need to initialize a blockmapgroup, which isn't happening right now.

It would also be nice to test that when the database crashes, the index in the init fork is loaded and can have vectors inserted into it. There isn't a great way to induce this within postgres at the moment. Maybe adding a function that does what pg_terminate_backend does but sends a SIGKILL instead of a SIGTERM. This function should probably not get compiled into release builds since crashing the backend is very much not desirable behavior. It may not be worth doing all this, but it would be a good sanity check to at least kill -9 the backend and make sure things work as expected.

Finally, it seems like we shouldn't be writing to the WAL if an index's underlying table in unlogged. I'm not sure what would happen in a crash. I think the fork would just get overwritten with the init fork so it wouldn't cause issues, but since it's not expected it's probably best to avoid. It would also marginally improve performance for this use case which is nice

juuu-jiii · 2023-10-23T03:43:55Z

@ezra-varady Thank you for the feedback! I'll factor the unlogged index code out of BuildIndex and place it into its own function, and I'll also look into testing index loading after database crashes.

Re:WAL, nothing is saved there if we are dealing with an unlogged table. Postgres performs checks under the hood to see whether the table in question is unlogged, and if it is, it skips xlog-related logic even if a WAL command is issued. I discovered this when looking at documentation as well as how other implementations of the aminsert index access method within Postgres code deal with unlogged tables.

Ngalstyan4 · 2024-01-13T23:58:46Z

Addressed by #253

juuu-jiii added 10 commits October 13, 2023 00:08

Add logic to build empty hnsw index in init fork

344874a

Verify that unlogged tables are not saved to WAL

afae0a6

Create util testing scripts using unlogged tables

2baffc2

Add test for unlogged table support

12fffa4

Add expected output for unlogged table test

8d76aa5

Add unlogged table test to test/schedule.txt

4f754d2

Consolidate index building into a single function

fc6707c

Update expected test output for unlogged tables

3d088b2

Revert accidental usearch changes

b2ad06e

Format hnsw/build.c and hnsw/insert.c

a117e54

Ngalstyan4 reviewed Oct 13, 2023

View reviewed changes

juuu-jiii force-pushed the main branch from 226ecd6 to a117e54 Compare October 13, 2023 07:14

Ngalstyan4 closed this Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for unlogged tables (#125) #196

Add support for unlogged tables (#125) #196

juuu-jiii commented Oct 10, 2023

Ngalstyan4 left a comment

Ngalstyan4 Oct 13, 2023

juuu-jiii Oct 13, 2023

Ngalstyan4 Oct 14, 2023

juuu-jiii Oct 16, 2023

juuu-jiii Oct 16, 2023 •

edited

Loading

Ngalstyan4 Oct 13, 2023

juuu-jiii Oct 13, 2023

ezra-varady commented Oct 18, 2023 •

edited

Loading

juuu-jiii commented Oct 23, 2023

Ngalstyan4 commented Jan 13, 2024

Add support for unlogged tables (#125) #196

Add support for unlogged tables (#125) #196

Conversation

juuu-jiii commented Oct 10, 2023

Ngalstyan4 left a comment

Choose a reason for hiding this comment

Ngalstyan4 Oct 13, 2023

Choose a reason for hiding this comment

juuu-jiii Oct 13, 2023

Choose a reason for hiding this comment

Ngalstyan4 Oct 14, 2023

Choose a reason for hiding this comment

juuu-jiii Oct 16, 2023

Choose a reason for hiding this comment

juuu-jiii Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

Ngalstyan4 Oct 13, 2023

Choose a reason for hiding this comment

juuu-jiii Oct 13, 2023

Choose a reason for hiding this comment

ezra-varady commented Oct 18, 2023 • edited Loading

juuu-jiii commented Oct 23, 2023

Ngalstyan4 commented Jan 13, 2024

juuu-jiii Oct 16, 2023 •

edited

Loading

ezra-varady commented Oct 18, 2023 •

edited

Loading