Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-8687 Track vassert messages for crash log #25051

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

michael-redpanda
Copy link
Contributor

@michael-redpanda michael-redpanda commented Feb 6, 2025

This pull request adds the capability to the crash tracker to log out aborts caused by vassert.

  • vassert messages are now captured in a thread local storage instance that will be accessed by
    the SIGABRT signal handler. This message will be written out to the generated crash report
  • The vassert macro will now call abort() rather than __builtin_trap(). This is because the implementation of __builtin_trap() is implementation dependent and may or may not raise a SIGILL signal.
  • New debug mode only endpoints are added for testing purposes to set the value contained by the assert message storage class for testing

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • None

@michael-redpanda michael-redpanda requested review from pgellert and a team February 6, 2025 01:47
@michael-redpanda michael-redpanda self-assigned this Feb 6, 2025
@michael-redpanda michael-redpanda requested a review from a team as a code owner February 6, 2025 01:48
@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch 3 times, most recently from 5b24f93 to cf31a6c Compare February 6, 2025 01:52
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 6, 2025

Retry command for Build#61645

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/storage_failure_injection_test.py::StorageFailureInjectionTest.test_storage_failure_injection

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 6, 2025

CI test results

test results on build#61645
test_id test_kind job_url test_status passed
assert_log_holder_test_rpunit.assert_log_holder_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61645#0194d8f6-2372-4a0e-ba8d-84c0f118a065 FAIL 0/2
assert_log_holder_test_rpunit.assert_log_holder_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/61645#0194d8f6-2373-411c-a317-0909013d80f2 FAIL 0/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61645#0194d954-7b74-4708-9428-2495e0f5876f FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61645#0194d954-7b74-4708-9428-2495e0f5876f FLAKY 1/2
rptest.tests.storage_failure_injection_test.StorageFailureInjectionTest.test_storage_failure_injection ducktape https://buildkite.com/redpanda/redpanda/builds/61645#0194d938-77ec-4564-9024-c364ca23d33d FAIL 0/20
rptest.tests.storage_failure_injection_test.StorageFailureInjectionTest.test_storage_failure_injection ducktape https://buildkite.com/redpanda/redpanda/builds/61645#0194d954-7b75-4de3-aa6f-724aba69073c FAIL 0/20
test results on build#61689
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd28-3869-4b8b-984d-460f70d5cf39 FLAKY 1/3
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd2c-2abf-4d74-967d-9ca4d910b390 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd28-3869-483d-ab3e-70c2c8beb87d FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd2c-2abd-4b2c-91f4-f3b5f295937e FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd2c-2abd-47b1-af84-366e29982458 FLAKY 1/4
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/61689#0194dd2c-2abd-4b2c-91f4-f3b5f295937e FLAKY 1/2
test results on build#61730
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/61730#0194e113-6369-4853-8da4-fbf4865b1ff5 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/61730#0194e0f7-5783-4198-8ed7-90516a7ec14a FLAKY 1/2
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/61730#0194e113-6368-4a7b-8793-07cb1d10d25b FLAKY 1/5
rptest.tests.datalake.compaction_test.CompactionGapsTest.test_translation_no_gaps.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_JDBC ducktape https://buildkite.com/redpanda/redpanda/builds/61730#0194e113-6368-454a-a30e-7d261e447951 FLAKY 1/3
rptest.tests.full_node_recovery_test.FullNodeRecoveryTest.test_node_recovery.recovery_type=full ducktape https://buildkite.com/redpanda/redpanda/builds/61730#0194e0f7-5781-4532-9b56-54f6ec3a1fb5 FLAKY 1/2
test results on build#62047
test_id test_kind job_url test_status passed
rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=compact.delete ducktape https://buildkite.com/redpanda/redpanda/builds/62047#01952179-6a2e-49b4-91f7-48c7b235566d FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/62047#01952179-6a2e-452d-817c-a301be6fb2bf FLAKY 1/2
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/62047#01952176-37b9-4fac-bd43-57f313408e39 FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3 ducktape https://buildkite.com/redpanda/redpanda/builds/62047#01952179-6a2e-49b4-91f7-48c7b235566d FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=3 ducktape https://buildkite.com/redpanda/redpanda/builds/62047#01952179-6a2c-4d84-83f3-6c7f536ded3e FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/62047#01952179-6a2c-4d84-83f3-6c7f536ded3e FLAKY 1/2
test results on build#62064
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/62064#01952452-47ba-4727-b75e-05f4d45fa5ef FLAKY 1/2
rptest.tests.log_compaction_test.LogCompactionTest.compaction_stress_test.cleanup_policy=compact.delete.key_set_cardinality=1000.storage_compaction_key_map_memory_kb=10 ducktape https://buildkite.com/redpanda/redpanda/builds/62064#01952456-2988-4737-b7f9-d2197ba85f0a FLAKY 1/2
rptest.tests.scaling_up_test.ScalingUpTest.test_scaling_up_with_recovered_topic ducktape https://buildkite.com/redpanda/redpanda/builds/62064#01952456-2988-4737-b7f9-d2197ba85f0a FLAKY 1/3
test results on build#62072
test_id test_kind job_url test_status passed
rptest.tests.cloud_storage_timing_stress_test.CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves.cleanup_policy=compact.delete ducktape https://buildkite.com/redpanda/redpanda/builds/62072#01952550-b74a-4157-a0bd-e920eb7e2fe2 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/62072#01952550-b74b-4d46-9761-22b5b8232cd5 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryTest.test_index_recovery ducktape https://buildkite.com/redpanda/redpanda/builds/62072#01952565-8f41-4be8-8660-2849b8a6da60 FLAKY 1/2
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.catalog_type=CatalogType.REST_HADOOP ducktape https://buildkite.com/redpanda/redpanda/builds/62072#01952565-8f41-4998-ad5a-af04c7e10628 FLAKY 1/2
rptest.tests.log_segment_ms_test.SegmentMsTest.test_segment_rolling.use_alter_cfg=False.server_cfg=300000.topic_cfg=None ducktape https://buildkite.com/redpanda/redpanda/builds/62072#01952565-8f41-4be8-8660-2849b8a6da60 FLAKY 1/2
test results on build#62104
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/62104#019528d7-7bc4-4914-ba93-9ca1bcd7b45c FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_shadow_indexing.num_to_upgrade=0.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/62104#019528db-5a03-405e-b307-9c7850d03e0b FLAKY 1/2

Copy link
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, mostly just questions from me. (+ fyi the new unit tests are failing)

@@ -678,6 +678,82 @@
]
}
]
},
{
"path": "/v1/debug/store_message/{shard}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Docs Team reviewers: These routes are only for testing and should not be documented.

@michael-redpanda michael-redpanda requested review from a team, BenPope and IoannisRP and removed request for a team February 6, 2025 19:16
@michael-redpanda
Copy link
Contributor Author

(+ fyi the new unit tests are failing)

Yeah the perils of having two build systems. Forgot GTEST in rp_unit

@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch from cf31a6c to 6bd5f0c Compare February 6, 2025 19:44
@michael-redpanda
Copy link
Contributor Author

Force push 6bd5f0c:

  • Added use of the assert storage class to unreachable()
  • Fixed issue with tests CMake file
  • Adjusted failure injection log allow list

@@ -99,6 +99,6 @@ static thread_local assert_log_holder g_assert_log_holder;
__LINE__, \
#x, \
##args); \
__builtin_trap(); \
abort(); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I might suggest std::abort() to avoid any potential for the wrong symbol being used here since this file is included pretty much everywhere in redpanda.

  • are we at any risk for the same deadlock problems as glibc 2.26 (listed in the notes on the man page) related to the closing/flushing of IO streams, particularly since we are going to be doing I/O in response to sigabort?

  • one of the original motivations behind builtin_trap was that it minimized the amount of code that was inlined into a vassert usage. presumably this isn't really an issue (at least in godbolt i see just a call instruction), but just noting that here for completeness.

The only downside of this switch is that now we will end up printing a stacktrace twice on vassert crashes (once in the vassert macro and once in seastar's SIGABRT handler).

presumably this wouldn't be an issue if we hooked into Seastar's signal handling hooks? Then we could keep builtin_trap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably this wouldn't be an issue if we hooked into Seastar's signal handling hooks? Then we could keep builtin_trap?

Our issue with builtin_trap is that what it does is implementation dependent. In I think most cases it raises a SIGILL but not all. We could change it to call raise(SIGILL) if we wish to maintain a similar behavior and get around any possible deadlocking issues caused by abort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me rephrase. is there a benefit to

vassert():
logging
signal

signal_handler():
crash reporting

vs

vassert():
logging
crash reporting
builtin_trap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked with @pgellert about this and we believe the second approach is preferrable. Less work in the handler is better, and avoids any annoying issues with memory of I/O getting messed up in the signal handler.

Comment on lines 505 to 513
if constexpr (admin_server::is_store_message_enabled()) {
register_route<superuser>(
ss::httpd::debug_json::put_store_message,
[this](std::unique_ptr<ss::http::request> req) {
return put_store_message(std::move(req));
});
register_route<superuser>(
ss::httpd::debug_json::get_store_message,
[this](std::unique_ptr<ss::http::request> req) {
return get_store_message(std::move(req));
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think its worth debating if these are actually compiled out vs compiled in but disabled. for example compiling them in, and requiring some test-only runtime switch to make them active woudl mean we could test in both debug and release modes...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took inspiration from the failure injector stuff. I agree that doing a runtime switch is preferable (we kind of do that with the RP_FIXTURE_TEST environment variable for the OpenSSL Context stuff in unit tests). Happy to discuss a path forward on this and adapting it to other things like failure injector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't matter to me one way or the other

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue with this approach for now and get some eyes/consensus on 9051. Once that's settled we'll come back and adjust.

@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch from 6bd5f0c to 432d396 Compare February 7, 2025 13:53
@michael-redpanda
Copy link
Contributor Author

Force push 432d396:

  • Rebase off dev to fix conflicts. No changes

// lock-free atomic is UB.
static thread_local inline std::atomic<const char*> _abuffer{nullptr};
};
static thread_local assert_log_holder g_assert_log_holder;
Copy link
Member

@dotnwat dotnwat Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm concerned this isn't going to do what you want (or that i'm mistaken about what the exact goal is). here you're going to i think get a separate instance of assert_log_holder in each translation unit (static foo in a header--very rare what is intended), then those would all contain another thread_local _abuffer that points to a single instance (since its inline)? it's a bit of brain twister.

would this be simpler:

class assert_log_holder {
  std::atomic...;
};

inline thread_local assert_log_holder g_assert_log_holder

here you get a single global instance of assert_log_holder per thread.

Copy link
Contributor Author

@michael-redpanda michael-redpanda Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably going to go away with the refactor to write out the file before raising the signal... this was only here so that the signal handle could access the vassert message

I re-read what you wrote and you're correct, the static declaration was wrong here. The thread_local is also not going to be necessary moving forward with the new implementation. Thanks for the call out.

@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch from 432d396 to a0b17db Compare February 20, 2025 02:25
@michael-redpanda
Copy link
Contributor Author

Note for reviewers: Need to reword some of the commits still, but wanted to get the changes up

Force push a0b17db:

  • Fixed license in assert_log_holder_test.cc
  • Major change: No longer deferring write to signal handler - writing to crash tracker file will now happen inline with vassert
    • Added a callback function to register with assert holder that will be used when message is formatted
    • Updated crash tracker to register a callback and write out message to crash tracker file
  • Updated admin endpoints - no longer put/store message, now just a single put to trigger crash tracker file to write due to vassert (doesn't actually call vassert)
  • No longer using abort() in unreachable or vassert - back to __builtin_trap()
  • Updated tests

@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch from a0b17db to acd89f3 Compare February 20, 2025 02:28
@michael-redpanda
Copy link
Contributor Author

Force push a0b17db:

  • Fixed lint error in python

@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch from acd89f3 to 764a2f0 Compare February 20, 2025 15:48
@michael-redpanda
Copy link
Contributor Author

Force push 764a2f0:

  • No code changes - just commit rewording

Copy link
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch from 764a2f0 to 783a803 Compare February 20, 2025 20:24
@michael-redpanda
Copy link
Contributor Author

Force push 783a803:

  • Fixed up commit messages to remove references to SIGABRT (no longer using abort() in vassert/unreachable)
  • Switched DT test to use SIGILL rather than SIGABRT
  • Removed unecessary failure log messages in redpanda.py
  • Hardened callback mechanism to protect against a race condition that may occur if one thread is registering a callback while another is attempting to register an event

Comment on lines +68 to +69
assert_cb_func before = nullptr;
_cb_func.compare_exchange_strong(before, cb);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want first writer wins, or last writer wins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If by "writer" you mean whoever calls register_cb, I'd say probably first... but in practice there's only one spot where this is going to be called (from recorder.cc)

Comment on lines 63 to 64
g_assert_log.l.error("{}", buffer);
g_assert_log.l.error("Backtrace:\n{}", bt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the log lines take precedence over the callback for example in the case the callback has issues given the state of the system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a very good point, thanks.

@@ -1029,3 +1032,43 @@ ss::future<ss::json::json_return_type> admin_server::override_node_uuid_handler(

co_return ss::json::json_return_type(ss::json::json_void());
}

ss::future<std::unique_ptr<ss::http::reply>> admin_server::put_ctracker_va(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should it go into the finjector url namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless i'm completely blind I don't see a finjector namespace. There is failure-probes but that's all honey badger related, including the root /v1/failure-probes endpoint.

Created a structure that will be used to hold a callback function that
will be called when an assert is fired.  The desire will be for the
crash tracker module to register a callback that will trigger creation
of a crash tracker file containing the vassert message and stack trace.

A call back was used rather than directly calling into the crash tracker
from assert in order to not make crash tracker a global depenency, as
vassert is today.

Signed-off-by: Michael Boquard <[email protected]>
* Added assertions crash_type
* Registered callback with the assert message holder
* Callback will write out assert message to crash tracker file

Signed-off-by: Michael Boquard <[email protected]>
New route in `/v1/debug/ctracker/va/{shard}` that permits setting an
assertion message in crash tracker (without calling vassert) from a
specified shard.

Signed-off-by: Michael Boquard <[email protected]>
Implements handler that will trigger a crash tracker message creation
caused by vassert (without calling vassert).  Will verify that {shard}
is correct and take in a message of type {"message": <message>}.

Signed-off-by: Michael Boquard <[email protected]>
This endpoint is only available in DEBUG builds of Redpanda.

Signed-off-by: Michael Boquard <[email protected]>
Changes necessary due to changes in how vassert messaging is
constructed.

Signed-off-by: Michael Boquard <[email protected]>
@michael-redpanda michael-redpanda force-pushed the startup-log/core-8687/vassert branch from 783a803 to 3b13263 Compare February 21, 2025 12:53
@michael-redpanda
Copy link
Contributor Author

Force push 3b13263:

  • Moved logging of assert message to before callback

Copy link
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. (There's just one remaining unused header in vassert.h. I am happy to clean that up in one of my follow up PRs.)

@@ -16,12 +16,63 @@

#include <seastar/util/backtrace.hh>
#include <seastar/util/log.hh>
#include <seastar/util/noncopyable_function.hh>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
#include <seastar/util/noncopyable_function.hh>

(I can carry this over to a follow up PR.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants