Integrate vfs2 #661

cole-miller · 2024-06-25T02:34:37Z

No description provided.

Signed-off-by: Cole Miller <[email protected]>

codecov · 2024-06-25T02:43:24Z

Codecov Report

Attention: Patch coverage is 71.08434% with 168 lines in your changes missing coverage. Please review.

Project coverage is 77.35%. Comparing base (6633bd8) to head (86ef5cd).
Report is 29 commits behind head on master.

Files	Patch %	Lines
src/fsm.c	56.25%	34 Missing and 15 partials ⚠️
src/server.c	61.34%	33 Missing and 13 partials ⚠️
src/vfs2.c	74.78%	10 Missing and 19 partials ⚠️
src/raft/replication.c	58.06%	7 Missing and 6 partials ⚠️
src/leader.c	61.90%	5 Missing and 3 partials ⚠️
src/raft/uv.c	77.27%	2 Missing and 3 partials ⚠️
test/raft/integration/test_uv_load.c	84.61%	4 Missing ⚠️
src/utils.h	57.14%	0 Missing and 3 partials ⚠️
src/raft/uv_metadata.c	75.00%	0 Missing and 2 partials ⚠️
src/raft/uv_segment.c	92.00%	0 Missing and 2 partials ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #661      +/-   ##
==========================================
- Coverage   77.41%   77.35%   -0.07%     
==========================================
  Files         196      196              
  Lines       27269    27377     +108     
  Branches     5455     5519      +64     
==========================================
+ Hits        21111    21178      +67     
+ Misses       4297     4286      -11     
- Partials     1861     1913      +52

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Cole Miller <[email protected]>

cole-miller · 2024-06-25T10:27:50Z

This PR does a few different things (sorry):

It removes --enable-dqlite-next from the build system and all mention of DQLITE_NEXT in the source code. All the code for the new disk-based configuration is now compiled in all build configurations, except for uses of new raft APIs that aren't available when linking against an old, external libraft.
It introduces new codepaths that use vfs2 instead of the old VFS, including calling functions that have no counterpart in the old setup, like vfs2_unapply. I've repurposed the existing dqlite_node_enable_disk_mode option to switch between the old in-memory VFS and vfs2; the existing on-disk configuration of the old VFS is no longer reachable after this PR.
It introduces a run-time configuration option for the raft_uv I/O implementation, raft_uv_set_format_version, that toggles between the backwards-compatible format for the persistent log and the new format (originally introduced and gated behind DQLITE_NEXT in Add local_data and is_local to raft_entry #639).
Includes various fixes for vfs2 that were added in the course of getting our test suite to pass when running in "disk mode". Note that the changes included in this initial pass are definitely not the final word---for example, as discussed elsewhere, vfs2_pseudo_read_begin needs to be rewritten and the way the WAL-index header is handled needs to be redone, also some of the fixes in this PR are more like band-aids and need to be redone more carefully.
Finally, reworks the initialization logic in server.c, moving things around so that we no longer leak resources if you create a dqlite_node and then destroy it without running it first (a longstanding issue, see comments in this file on master). I got sucked into doing this while working on the extra initialization required to set up vfs2, although it wasn't strictly necessary to carry it as far as I have done here.

letFunny

Did a first pass to half of the files approximately. The PR is looking very good, thanks!

I just have one general comment and that is that we need to make sure to remove all the usages of the disk config prior to this PR. For example: https://github.com/canonical/dqlite/blob/master/src/db.c#L50

letFunny · 2024-06-25T10:44:16Z

configure.ac

@@ -39,10 +39,6 @@ AM_CONDITIONAL(BUILD_SQLITE_ENABLED, test "x$enable_build_sqlite" = "xyes")
 AC_ARG_ENABLE(build-raft, AS_HELP_STRING([--enable-build-raft[=ARG]], [use the bundled raft sources instead of linking to libraft [default=no]]))
 AM_CONDITIONAL(BUILD_RAFT_ENABLED, test "x$enable_build_raft" = "xyes")

-AC_ARG_ENABLE(dqlite-next, AS_HELP_STRING([--enable-dqlite-next[=ARG]], [build with the experimental dqlite backend [default=no]]))
-AM_CONDITIONAL(DQLITE_NEXT_ENABLED, test "x$enable_dqlite_next" = "xyes")
-AS_IF([test "x$enable_build_raft" != "xyes" -a "x$enable_dqlite_next" = "xyes"], [AC_MSG_ERROR([dqlite-next requires bundled raft])], [])


If dqlite next needs to bundle raft, shouldn't we remove the option enable_build_raft as well and make it a default?

EDIT: we discussed this in the daily but now I think it makes more sense to do it.

I think we should do this for the next release but need to make sure first that our known downstreams are ready to move to bundled raft (or have already done so).

letFunny · 2024-07-01T08:36:14Z

src/utils.h

@@ -22,8 +24,20 @@
 #define POST(cond) assert((cond))
 #define ERGO(a, b) (!(a) || (b))

+#define UNHANDLED(expr) if (expr) assert(0)


Should this macro log something or is the stack trace going to be enough?

BTW thanks for this macro, I find code easier to read this way.

I introduced the macro to mark places where there's a potential failure that I didn't yet know how to handle, so at least the defective lines are easily greppable. It doesn't mean I think we should abort in all these cases (I should've explained this).

letFunny · 2024-07-01T09:14:31Z

src/fsm.c

+	/* TODO maybe vfs2 should just accept the pages and page numbers
+	 * in the layout that we receive them over the wire? */
+	dqlite_vfs_frame *frames = sqlite3_malloc((int)sizeof(*frames) * (int)cf->frames.n_pages);
+	UNHANDLED(frames == NULL);


We need to have a unified approach for out of memory errors, either crash or return an error code, but right now I see code doing both things.

Yes, generally we try to return an error code for that---anything marked with UNHANDLED here is intended to be temporary until proper error handling is introduced. (Writing it this way initially helped me develop the patch faster, but I can fix the UNHANDLED call sites in this PR before it merges.)

Can you add a TODO above the UNHANDLED definition macro then? Just so we remember to update it.

letFunny · 2024-07-01T09:17:18Z

src/fsm.c

+	/* TODO maybe vfs2 should just accept the pages and page numbers
+	 * in the layout that we receive them over the wire? */


This would still mean we would copy the data and not use the pointer from the command directly, right? So it would translate into removing the "translation loop" below from one data structure to the other, right?

It should be possible for vfs2_apply_uncommitted to parse the frames data in exactly the format that it comes over the wire, so no copy is needed at all. Maybe I should add that to the PR while I'm at it?

It can also be part of the next PR, but if the changes are minimal and we are avoiding a copy it seems worth it to me.

src/fsm.c

letFunny · 2024-07-01T09:37:08Z

src/fsm.c

-	struct fsm *f = raft_malloc(sizeof *f);
-
+	(void)config;
+	struct fsm *f = raft_malloc(sizeof(*f));
 	if (f == NULL) {
 		return DQLITE_NOMEM;


Example of what I meant in the previous comment regarding error handling on memory allocation.

letFunny · 2024-07-01T09:41:00Z

src/gateway.c

-	struct dqlite_node *node = g->raft->data;
-	pool_t *pool = !!(pool_ut_fallback()->flags & POOL_FOR_UT)
-		? pool_ut_fallback() : &node->pool;
-	pool_queue_work(pool, &req->work, g->leader->db->cookie,
-			WT_UNORD, qb_top, qb_bottom);


Why are we removing this?

The new calls into vfs2 in this PR all happen on the main thread, so in order to not have data races I had to undo the earlier change that moved sqlite3_step calls to the thread pool. The next PR will make everything async again.

letFunny · 2024-07-01T09:45:46Z

src/raft.h

+/**
+ */


Missing doc.

letFunny · 2024-07-01T10:43:00Z

src/raft/replication.c

-	rv = r->fsm->apply(r->fsm, buf, &result);
+
+	if (r->fsm->version >= 4 && r->fsm->apply2 != NULL) {
+		bool is_mine = is_local && term == r->current_term;


I think this should be part of the documentation of the apply2 function, stating what it means for an entry to be mine. Also, let's try to think about a better name that conveys that it has to have been created by the node while it was the leader (if we cannot, then the documentation is enough and we can keep it as is_mine). For example, what about created_as_leader_current_term? (maybe too verbose)

Agreed about documentation. Maybe a more verbose name in the header file and the concise is_mine when it's referred to in .c?

That would work.

letFunny · 2024-07-01T10:48:35Z

src/raft/uv.c

-	rv = uvMetadataLoad(uv->dir, &metadata, io->errmsg);
-	if (rv != 0) {
-		return rv;
-	}
-	uv->metadata = metadata;
-


Why are we moving this and the timer to _start and _load?

uvMetadataLoad checks the format_version field to detect malformed metadata files. The correct format version isn't known until load or bootstrap time, so I moved the metadata loading later on. But I'm definitely less than 100% certain that this is the right way to do it. Maybe it would be a better idea to just modify raft_uv_init to accept the desired format version as an additional argument?

Let me check my understanding here: we are using the disk_mode as a toggle for all of these changes (vfs2 + new format representation) which is why by default uv->format_version = 1 and we change it when calling dqlite_node_enable_disk_mode. And we need to support both implementations for at least some time.

If all of that is true then I guess it makes sense to keep it as is because the sequence has to be init -> enable_disk_mode -> load. Ideally we would pass the format as a parameter from the beginning so that we don't have to think about when the format changes and when to do each operation, like reading metadata, but it seems that is not possible here just yet with the current design.

letFunny

Reviewed another batch of files. I think we could have split this PR into the changes for the format version and removing the DQLITE_NEXT macro and the changes to the vfs2 (maybe it is not that straightforward to do). Definitely not something to do now, but for next PRs it might make reviews faster.

letFunny · 2024-07-03T08:25:08Z

src/raft/uv.h

@@ -47,6 +47,7 @@ typedef unsigned long long uvCounter;
 /* Information persisted in a single metadata file. */
 struct uvMetadata


Is the metadata tied to the uv-based raft_io implementation below or is it agnostic? I assume it is the former because it is in the uv.h file. If that is the case, can we add a comment in the format_version like the one we have below?

/* 1 (original recipe) or 2 (with local data) */

If we need to document the format change further we could define an enum and document it there more thoroughly. The benefit of an enum would be that the checks PRE(1 <= version && version < 3) would automatically be in sync when we add new fields instead of having to change all occurrences.

letFunny · 2024-07-03T08:36:00Z

src/raft/uv.c

-	rv = uvMetadataLoad(uv->dir, &metadata, io->errmsg);
-	if (rv != 0) {
-		return rv;
-	}
-	uv->metadata = metadata;
-


Let me check my understanding here: we are using the disk_mode as a toggle for all of these changes (vfs2 + new format representation) which is why by default uv->format_version = 1 and we change it when calling dqlite_node_enable_disk_mode. And we need to support both implementations for at least some time.

If all of that is true then I guess it makes sense to keep it as is because the sequence has to be init -> enable_disk_mode -> load. Ideally we would pass the format as a parameter from the beginning so that we don't have to think about when the format changes and when to do each operation, like reading metadata, but it seems that is not possible here just yet with the current design.

letFunny · 2024-07-03T08:38:13Z

src/raft/uv.c

 	rv = uv->transport->init(uv->transport, id, address);
 	if (rv != 0) {
 		ErrMsgTransfer(uv->transport->errmsg, io->errmsg, "transport");
 		return rv;
 	}
 	uv->transport->data = uv;

-	rv = uv_timer_init(uv->loop, &uv->timer);


Why is it that we are moving the timer?

letFunny · 2024-07-03T08:47:32Z

src/raft/uv_encoding.c

 {
 	size_t res = 8 + /* Number of entries in the batch, little endian */
 		16 * n; /* One header per entry */;
-	if (with_local_data) {
-#ifdef DQLITE_NEXT
+	if (format_version > 1) {


This might be tricky to find in the future if the format changes in a way that drops local data for version 3 (for example). Do you think it is better to check for the format version explicitly?

letFunny · 2024-07-03T08:48:26Z

src/raft/uv_encoding.c

@@ -143,7 +143,7 @@ static void encodeAppendEntries(const struct raft_append_entries *p, void *buf)
 	bytePut64(&cursor, p->prev_log_term);  /* Previous term. */
 	bytePut64(&cursor, p->leader_commit);  /* Commit index. */

-	uvEncodeBatchHeader(p->entries, p->n_entries, cursor, false /* no local data */);
+	uvEncodeBatchHeader(p->entries, p->n_entries, cursor, 1 /* no local data ever */);


Why no local data ever? I assume it is because this is only used to send AppendEntries messages and we don't transmit local data. Maybe a slightly bigger comment saying something like: encodeAppendEntries is only called when sending AppendEntries messages and local data is never transmitted, is more clear. What do you think?

letFunny · 2024-07-03T08:49:46Z

src/raft/uv_encoding.c

 {
 	unsigned i;
 	void *cursor = buf;

 	/* Number of entries in the batch, little endian */
 	bytePut64(&cursor, n);

-	if (with_local_data) {
-#ifdef DQLITE_NEXT
+	if (format_version > 1) {


letFunny · 2024-07-03T08:50:01Z

src/raft/uv_encoding.c

@@ -391,15 +390,13 @@ int uvDecodeBatchHeader(const void *batch,
 		return 0;
 	}

-	if (local_data_size != NULL) {
-#ifdef DQLITE_NEXT
+	if (format_version > 1) {


letFunny · 2024-07-03T08:51:40Z

src/raft/uv_encoding.c

@@ -456,7 +453,7 @@ static int decodeAppendEntries(const uv_buf_t *buf,
 	args->prev_log_term = byteGet64(&cursor);
 	args->leader_commit = byteGet64(&cursor);

-	rv = uvDecodeBatchHeader(cursor, &args->entries, &args->n_entries, false);
+	rv = uvDecodeBatchHeader(cursor, &args->entries, &args->n_entries, NULL, 1 /* no local data ever */);


Same nit about slightly longer documentation.

letFunny · 2024-07-03T08:53:11Z

src/raft/uv_recv.c

@@ -295,7 +295,7 @@ static void uvServerReadCb(uv_stream_t *stream,
 					    s->message.append_entries.entries,
 					    s->message.append_entries
 						.n_entries,
-					    false);
+					    0, 1 /* no local data ever */);


Same nit, although in this case it might be easier to see because the function name already states that it is "receiving" something. Might still be useful for future readers though.

letFunny · 2024-07-03T08:55:52Z

src/raft/uv_segment.c

@@ -405,7 +405,7 @@ int uvSegmentLoadClosed(struct uv *uv,
 	if (rv != 0) {
 		goto err;
 	}
-	if (format != UV__DISK_FORMAT) {
+	if (format != (uint64_t)uv->format_version) {


Nit: maybe be super defensive with uv->format_version < 0 || format != ... (just for the extra security going from int to uint) or even a PRE. This applies to several places but it is very minor so feel free to ignore it.

cole-miller · 2024-07-11T12:29:24Z

Closing in favor of more focused PRs.

cole-miller added 18 commits June 21, 2024 14:41

format version changes

6efb06c

Signed-off-by: Cole Miller <[email protected]>

Removing references to DQLITE_NEXT; format_version

f41e053

Signed-off-by: Cole Miller <[email protected]>

removing the rest of DQLITE_NEXT

3d8cf3b

Signed-off-by: Cole Miller <[email protected]>

remove DQLITE_NEXT from build system

dbe5914

Signed-off-by: Cole Miller <[email protected]>

initialization changes

caf7a00

Signed-off-by: Cole Miller <[email protected]>

fsm changes

2344e7c

Signed-off-by: Cole Miller <[email protected]>

vfs2 fixes

fc9fb8b

Signed-off-by: Cole Miller <[email protected]>

further raft changes

21c66d2

Signed-off-by: Cole Miller <[email protected]>

leader changes

40a22af

Signed-off-by: Cole Miller <[email protected]>

raft header changes

2b4bf32

Signed-off-by: Cole Miller <[email protected]>

server header changes for initialization

36d2bba

Signed-off-by: Cole Miller <[email protected]>

utils

7d14991

Signed-off-by: Cole Miller <[email protected]>

gateway

1328c5f

Signed-off-by: Cole Miller <[email protected]>

test fixes

0aa9cbc

Signed-off-by: Cole Miller <[email protected]>

temporary vfs2 fix

1854700

Signed-off-by: Cole Miller <[email protected]>

test fixes

9ba88fb

Signed-off-by: Cole Miller <[email protected]>

Update many tests that expect working snapshots in disk mode

3d3fb62

Signed-off-by: Cole Miller <[email protected]>

Temporarily skip a test that requires vfs2 digging

f355066

Signed-off-by: Cole Miller <[email protected]>

cole-miller added 2 commits June 25, 2024 09:49

Remove stray debugging

5f72382

Signed-off-by: Cole Miller <[email protected]>

Guard initial_barrier_cb

86ef5cd

Signed-off-by: Cole Miller <[email protected]>

letFunny reviewed Jul 1, 2024

View reviewed changes

letFunny reviewed Jul 3, 2024

View reviewed changes

cole-miller mentioned this pull request Jul 11, 2024

Remove DQLITE_NEXT guard and unify with disk mode #671

Closed

cole-miller closed this Jul 11, 2024

		/* TODO maybe vfs2 should just accept the pages and page numbers
		* in the layout that we receive them over the wire? */

		@@ -47,6 +47,7 @@ typedef unsigned long long uvCounter;
		/* Information persisted in a single metadata file. */
		struct uvMetadata

		/**
		*/

Integrate vfs2 #661

Integrate vfs2 #661

Conversation

cole-miller commented Jun 25, 2024

codecov bot commented Jun 25, 2024 • edited Loading

Codecov Report

cole-miller commented Jun 25, 2024

letFunny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cole-miller Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

letFunny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cole-miller commented Jul 11, 2024

codecov bot commented Jun 25, 2024 •

edited

Loading

cole-miller Jul 1, 2024 •

edited

Loading