-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor leader.c to fix stack growth in handle_exec_sql #697
Refactor leader.c to fix stack growth in handle_exec_sql #697
Conversation
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #697 +/- ##
==========================================
- Coverage 81.07% 80.95% -0.12%
==========================================
Files 197 196 -1
Lines 29164 29186 +22
Branches 4066 4089 +23
==========================================
- Hits 23644 23627 -17
- Misses 3875 3895 +20
- Partials 1645 1664 +19 ☔ View full report in Codecov by Sentry. |
Jepsen run at https://github.com/canonical/jepsen.dqlite/actions/runs/10672883562, looks clean. |
Signed-off-by: Cole Miller <[email protected]>
Status: trying to identify and fix the regression seen here canonical/lxd#14034, namely |
Signed-off-by: Cole Miller <[email protected]>
6fd1b60
to
63e0ede
Compare
Signed-off-by: Cole Miller <[email protected]>
I believe the LXD issue has been fixed, the problem was returning 0 from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks Cole. I just have a few questions to improve my understanding and some very small suggestions.
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Signed-off-by: Cole Miller <[email protected]>
Testing again locally revealed that (contrary to what I remembered) 1000 isn't enough to cause a stack overflow with the old implementation on my machine. Signed-off-by: Cole Miller <[email protected]>
I don't remember why I thought there was a problem here originally, but investigation with printf shows that the second flush is happening as intended. Signed-off-by: Cole Miller <[email protected]>
This PR is another attempt to fix the stack growth issue with handle_exec_sql (related to #679), this time in a more principled way that involves a significant refactor of leader.c. Details follow.
Recall that the issue arises with a call stack like this (callees at the top)
that is, indirect recursion that can generate a number of stack frames proportional to the number of
;
-separated statements in anEXEC_SQL
request. This happens because leader__barrier and leader__exec can invoke their callbacks synchronously when suspending is not required (respectively, because the FSM is up to date with the raft log and and becausesqlite3_step
generated no changed pages).This PR rewrites those two functions, establishing a new contract: the callback is only invoked if we yielded to the event loop at least once while processing the request; if the request was processed synchronously, a magic value
LEADER_NOT_ASYNC
is returned to indicate that the caller should invoke the callback. Existing callsites are updated to reflect this new contract.With these new implementations available, we can fix the original problem by rewriting
handle_exec_sql_next
to iteratively process as many statements as possible until one of them suspends or returns an error. The PR includes a regression test to validate this fix.The new implementations of the leader.c functions are intentionally different in style from the old ones. Effectively, each barrier request and exec request is a coroutine driven by a state machine. With this approach the
LEADER_NOT_ASYNC
feature falls out rather naturally, and we get the added benefits of more compact code and built-in observability.