-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-8687 Track vassert messages for crash log #25051
base: dev
Are you sure you want to change the base?
CORE-8687 Track vassert messages for crash log #25051
Conversation
5b24f93
to
cf31a6c
Compare
Retry command for Build#61645please wait until all jobs are finished before running the slash command
|
CI test resultstest results on build#61645
test results on build#61689
test results on build#61730
test results on build#62047
test results on build#62064
test results on build#62072
test results on build#62104
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, mostly just questions from me. (+ fyi the new unit tests are failing)
@@ -678,6 +678,82 @@ | |||
] | |||
} | |||
] | |||
}, | |||
{ | |||
"path": "/v1/debug/store_message/{shard}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Docs Team reviewers: These routes are only for testing and should not be documented.
Yeah the perils of having two build systems. Forgot |
cf31a6c
to
6bd5f0c
Compare
Force push 6bd5f0c:
|
src/v/base/vassert.h
Outdated
@@ -99,6 +99,6 @@ static thread_local assert_log_holder g_assert_log_holder; | |||
__LINE__, \ | |||
#x, \ | |||
##args); \ | |||
__builtin_trap(); \ | |||
abort(); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I might suggest
std::abort()
to avoid any potential for the wrong symbol being used here since this file is included pretty much everywhere in redpanda. -
are we at any risk for the same deadlock problems as glibc 2.26 (listed in the notes on the man page) related to the closing/flushing of IO streams, particularly since we are going to be doing I/O in response to sigabort?
-
one of the original motivations behind builtin_trap was that it minimized the amount of code that was inlined into a vassert usage. presumably this isn't really an issue (at least in godbolt i see just a
call
instruction), but just noting that here for completeness.
The only downside of this switch is that now we will end up printing a stacktrace twice on vassert crashes (once in the vassert macro and once in seastar's SIGABRT handler).
presumably this wouldn't be an issue if we hooked into Seastar's signal handling hooks? Then we could keep builtin_trap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presumably this wouldn't be an issue if we hooked into Seastar's signal handling hooks? Then we could keep builtin_trap?
Our issue with builtin_trap is that what it does is implementation dependent. In I think most cases it raises a SIGILL
but not all. We could change it to call raise(SIGILL)
if we wish to maintain a similar behavior and get around any possible deadlocking issues caused by abort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me rephrase. is there a benefit to
vassert():
logging
signal
signal_handler():
crash reporting
vs
vassert():
logging
crash reporting
builtin_trap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked with @pgellert about this and we believe the second approach is preferrable. Less work in the handler is better, and avoids any annoying issues with memory of I/O getting messed up in the signal handler.
src/v/redpanda/admin/debug.cc
Outdated
if constexpr (admin_server::is_store_message_enabled()) { | ||
register_route<superuser>( | ||
ss::httpd::debug_json::put_store_message, | ||
[this](std::unique_ptr<ss::http::request> req) { | ||
return put_store_message(std::move(req)); | ||
}); | ||
register_route<superuser>( | ||
ss::httpd::debug_json::get_store_message, | ||
[this](std::unique_ptr<ss::http::request> req) { | ||
return get_store_message(std::move(req)); | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think its worth debating if these are actually compiled out vs compiled in but disabled. for example compiling them in, and requiring some test-only runtime switch to make them active woudl mean we could test in both debug and release modes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took inspiration from the failure injector stuff. I agree that doing a runtime switch is preferable (we kind of do that with the RP_FIXTURE_TEST
environment variable for the OpenSSL Context stuff in unit tests). Happy to discuss a path forward on this and adapting it to other things like failure injector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't matter to me one way or the other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue with this approach for now and get some eyes/consensus on 9051. Once that's settled we'll come back and adjust.
6bd5f0c
to
432d396
Compare
Force push 432d396:
|
src/v/base/vassert.h
Outdated
// lock-free atomic is UB. | ||
static thread_local inline std::atomic<const char*> _abuffer{nullptr}; | ||
}; | ||
static thread_local assert_log_holder g_assert_log_holder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm concerned this isn't going to do what you want (or that i'm mistaken about what the exact goal is). here you're going to i think get a separate instance of assert_log_holder
in each translation unit (static foo
in a header--very rare what is intended), then those would all contain another thread_local _abuffer
that points to a single instance (since its inline)? it's a bit of brain twister.
would this be simpler:
class assert_log_holder {
std::atomic...;
};
inline thread_local assert_log_holder g_assert_log_holder
here you get a single global instance of assert_log_holder
per thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably going to go away with the refactor to write out the file before raising the signal... this was only here so that the signal handle could access the vassert message
I re-read what you wrote and you're correct, the static
declaration was wrong here. The thread_local
is also not going to be necessary moving forward with the new implementation. Thanks for the call out.
432d396
to
a0b17db
Compare
Note for reviewers: Need to reword some of the commits still, but wanted to get the changes up Force push a0b17db:
|
a0b17db
to
acd89f3
Compare
Force push a0b17db:
|
acd89f3
to
764a2f0
Compare
Force push 764a2f0:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great
764a2f0
to
783a803
Compare
Force push 783a803:
|
assert_cb_func before = nullptr; | ||
_cb_func.compare_exchange_strong(before, cb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want first writer wins, or last writer wins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If by "writer" you mean whoever calls register_cb
, I'd say probably first... but in practice there's only one spot where this is going to be called (from recorder.cc)
src/v/base/vassert.h
Outdated
g_assert_log.l.error("{}", buffer); | ||
g_assert_log.l.error("Backtrace:\n{}", bt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the log lines take precedence over the callback for example in the case the callback has issues given the state of the system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a very good point, thanks.
@@ -1029,3 +1032,43 @@ ss::future<ss::json::json_return_type> admin_server::override_node_uuid_handler( | |||
|
|||
co_return ss::json::json_return_type(ss::json::json_void()); | |||
} | |||
|
|||
ss::future<std::unique_ptr<ss::http::reply>> admin_server::put_ctracker_va( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should it go into the finjector url namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless i'm completely blind I don't see a finjector
namespace. There is failure-probes
but that's all honey badger related, including the root /v1/failure-probes
endpoint.
Created a structure that will be used to hold a callback function that will be called when an assert is fired. The desire will be for the crash tracker module to register a callback that will trigger creation of a crash tracker file containing the vassert message and stack trace. A call back was used rather than directly calling into the crash tracker from assert in order to not make crash tracker a global depenency, as vassert is today. Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
* Added assertions crash_type * Registered callback with the assert message holder * Callback will write out assert message to crash tracker file Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
New route in `/v1/debug/ctracker/va/{shard}` that permits setting an assertion message in crash tracker (without calling vassert) from a specified shard. Signed-off-by: Michael Boquard <[email protected]>
Implements handler that will trigger a crash tracker message creation caused by vassert (without calling vassert). Will verify that {shard} is correct and take in a message of type {"message": <message>}. Signed-off-by: Michael Boquard <[email protected]>
This endpoint is only available in DEBUG builds of Redpanda. Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
Signed-off-by: Michael Boquard <[email protected]>
Changes necessary due to changes in how vassert messaging is constructed. Signed-off-by: Michael Boquard <[email protected]>
783a803
to
3b13263
Compare
Force push 3b13263:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. (There's just one remaining unused header in vassert.h. I am happy to clean that up in one of my follow up PRs.)
@@ -16,12 +16,63 @@ | |||
|
|||
#include <seastar/util/backtrace.hh> | |||
#include <seastar/util/log.hh> | |||
#include <seastar/util/noncopyable_function.hh> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
#include <seastar/util/noncopyable_function.hh> |
(I can carry this over to a follow up PR.)
This pull request adds the capability to the crash tracker to log out aborts caused by
vassert
.vassert
messages are now captured in a thread local storage instance that will be accessed bythe
SIGABRT
signal handler. This message will be written out to the generated crash reportvassert
macro will now callabort()
rather than__builtin_trap()
. This is because the implementation of__builtin_trap()
is implementation dependent and may or may not raise aSIGILL
signal.Backports Required
Release Notes