Return non-zero exit code for non graceful shutdown #8

alexander-jiang · 2025-01-08T18:48:13Z

When testing in staging with a broken pubsub sink (& healthcheck disabled so that vector will start), the vector process returns a exit code of 0, even though it fails to shut down the sink component within the graceful shutdown time limit.

This change will update the behavior so that Vector shutsdown with a non-zero exit code (which we can use later on to determine whether vector was gracefully shutdown or not, so we can determine whether or not to run buffer recovery)

test cases: see PR comment below

src/topology/running.rs

avinashak · 2025-01-08T19:54:41Z

src/topology/running.rs

+        // Panic to ensure that Vector returns a non-zero exit code when it's unable to gracefully shutdown
+        futures::future::join(source_shutdown_complete, shutdown_complete_future)
+            .map(|futures| match futures.1.0 {
+                Result::Err(_) => panic!("failed to gracefully shutdown in time, panic to force non-zero exit code"),


is there a way to avoid a panic! here without changing the interface. Given this seems to be returning a Future, can we wrap a Result in a future instead and unwrap that in the exit handlers?

hmm I guess the output future is bounded to contain a unit type here

@alexander-jiang what do u think of panic! ing after you collect all the values instead of as soon as you see the first error? this will let all the components shutdown before we engage the panic trap to crash out when we detect a bad shutdown.

also we should tag this to the team for review more widely

My understanding is the future::join will wait for both the futures (source_shutdown_complete and shutdown_complete_future ) to complete before returning, so we only panic after waiting for both of those futures to complete.

The shutdown_complete_future uses a future::select_all over the futures, which waits until any of specified futures are ready, however, in that case I think it is correct to not wait for all - the select_all will wait until the deadline is exceeded (the timeout future) or until all the tasks are done (the success future), the reporter future is an infinite loop that writes a log message/event.

small correction i think - the select_all handler in https://github.com/discord/vector/pull/8/files#diff-bdafd184965fb9c8ceeb9213965a2d5dbe43037d8f5f3501f3d0fb9d11b4cc8fR197 waits till the first event between - timeout, success or the reporting(currently running) components returns.

which makes me think we should also declare the timeout scenario likely to also be erroneous even if any component was not having a reporting loop for completion? 💭 though this scenario is likely not to manifest from my interpretation

alexander-jiang · 2025-01-08T20:16:18Z

testing with an intentionally broken pubsub sink component and then sending a SIGTERM to the vector process directly (using kill -s SIGTERM {vector_pid} on the pod)

(note the panic message below alexj's panic: failed to gracefully shutdown in time is different from the panic message in the latest version of the PR, I updated it after running this test)

{"timestamp":"2025-01-08T19:44:02.730800Z","level":"INFO","message":"Signal received.","signal":"SIGTERM","target":"vector::signal"}
{"timestamp":"2025-01-08T19:44:02.730854Z","level":"INFO","message":"Vector has stopped.","target":"vector"}
[...]
{"timestamp":"2025-01-08T19:44:07.731863Z","level":"INFO","message":"Shutting down... Waiting on running components.","remaining_components":"\"event_collector_sender\"","time_remaining":"\"9 seconds left\"","target":"vector::topology::running"}
{"timestamp":"2025-01-08T19:44:12.731662Z","level":"INFO","message":"Shutting down... Waiting on running components.","remaining_components":"\"event_collector_sender\"","time_remaining":"\"4 seconds left\"","target":"vector::topology::running"}
[...]
thread 'main' panicked at src/topology/running.rs:208:35:
alexj's panic: failed to gracefully shutdown in time
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
{"timestamp":"2025-01-08T19:44:17.731648Z","level":"ERROR","message":"Failed to gracefully shut down in time. Killing components.","components":"\"event_collector_sender\"","target":"vector::topology::running"}
2025-01-08 19:44:17,742 WARN exited: vector (exit status 101; not expected)
vector exited with non-zero code = 101: vector did not shut down gracefully
Removing buffer.lock file
Able to reach GCS bucket vector-stg
Initial buffer directory size = 445999928
starting buffer recovery...
buffer directory size decreased, current size = 311789016
buffer directory size decreased, current size = 177575120
buffer directory size decreased, current size = 43367360
Terminating buffer recovery vector process...
2025-01-08 19:44:39,118 WARN received SIGTERM indicating exit request
2025-01-08 19:44:39,118 INFO waiting for buffer_recovery to die
removing vector socket /tmp/vector.sock
sending sigterm to parent PID 7
2025-01-08 19:44:39,155 INFO exited: buffer_recovery (exit status 0; expected)

alexander-jiang · 2025-01-08T21:02:20Z

Some notes on the vector code that I'm patching here:
the RunningTopology.stop() function (in src/topology/running.rs) is called by the TopologyController.stop() function (in src/topology/controller.rs), which is called by the FinishedApplication.stop() function -- this last call is I believe what is called when the vector process receives a shutdown signal (see the os_signals function in src/signal.rs -> SIGTERM maps to SignalTo::Shutdown)

avinashak

requesting a few test scenarios be run before merging this:

add two broken sinks and ensures both run till timeout
run the broken sinks scenarios multiple times to check if there are potential cases where the components don't wait for completion for completeness
run both broken and healthy sink configurations to ensure shutdown covers all components

…eneric types

… initialized and then catches up to the writer, it waits for the writer to notify it (but the writer will never write more data in the vector buffer recovery topology/pipeline)

alexander-jiang · 2025-01-14T19:26:01Z

Update - I tested again:

with the happy path (no sink configs broken) and when I sent a SIGTERM to the vector process pid directly (i.e. kill -s SIGTERM {pid}) , the exit code was 0 (as expected)
with the unhappy path (one broken pubsub sink config: invalid topic name) and when I sent a SIGTERM to the vector process pid directly (i.e. kill -s SIGTERM {pid}) , the exit code was not 0 (was 101, due to the panic), as expected
with the happy path, when I killed the pod (kubectl delete pod {pod_name}) - the exit code was unknown (the vector_start.sh script did not get a chance to write the exit code)
with unhappy path, where I added one broken pubsub sink config but kept one healthy pubsub sink config - then when I sent the SIGTERM to the vector process pid, the exit code was not 0 (as expected)

alexander-jiang · 2025-01-14T20:38:16Z

I didn't squash the commits correctly when merging so I did a manual rebase and force push on the remote data_infra_vector_stable branch

alexander-jiang self-assigned this Jan 8, 2025

github-actions bot added the domain: topology label Jan 8, 2025

github-advanced-security bot found potential problems Jan 8, 2025

View reviewed changes

src/topology/running.rs Fixed Show fixed Hide fixed

alexander-jiang requested a review from avinashak January 8, 2025 19:49

avinashak reviewed Jan 8, 2025

View reviewed changes

alexander-jiang requested review from youknowjack0, derek-truong, hsarabu, mikals0ft and ruizh22 January 8, 2025 20:07

avinashak self-requested a review January 9, 2025 18:27

avinashak approved these changes Jan 9, 2025

View reviewed changes

alexander-jiang added 6 commits January 9, 2025 19:50

Panic if the components failed to gracefully shutdown in time

bd7b96d

try to fix the compile error

62e70a4

fix more compile errors/warnings - remove unreachable code, specify g…

c445adb

…eneric types

try to fix compile error again

3d28e26

map the infinite-loop future to return a Result instead

4e290f9

clean up the panic msg and variable names

2fe231a

alexander-jiang force-pushed the panic-for-non-graceful-shutdown branch from e9f720f to 2fe231a Compare January 9, 2025 19:50

alexander-jiang added 6 commits January 9, 2025 21:38

add list of components that failed to shut down in time to panic msg

ebafc38

fix the panic format string syntax

00de9d8

fix a compile error

67fb2dc

add some debug logs to check my hypothesis: when the buffer reader is…

290f56e

… initialized and then catches up to the writer, it waits for the writer to notify it (but the writer will never write more data in the vector buffer recovery topology/pipeline)

tweak debug log msgs

e37a2f3

fix missing semicolon

b05ef55

alexander-jiang merged commit 2a7cc57 into data_infra_vector_stable Jan 14, 2025
44 of 48 checks passed

alexander-jiang added a commit that referenced this pull request Jan 14, 2025

Return non-zero exit code for non graceful shutdown (#8)

a35ab98

alexander-jiang deleted the panic-for-non-graceful-shutdown branch January 14, 2025 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return non-zero exit code for non graceful shutdown #8

Return non-zero exit code for non graceful shutdown #8

alexander-jiang commented Jan 8, 2025 •

edited

Loading

avinashak Jan 8, 2025

avinashak Jan 8, 2025

avinashak Jan 8, 2025

alexander-jiang Jan 8, 2025 •

edited

Loading

avinashak Jan 8, 2025

alexander-jiang commented Jan 8, 2025 •

edited

Loading

alexander-jiang commented Jan 8, 2025

avinashak left a comment

alexander-jiang commented Jan 14, 2025 •

edited

Loading

alexander-jiang commented Jan 14, 2025 •

edited

Loading

Return non-zero exit code for non graceful shutdown #8

Return non-zero exit code for non graceful shutdown #8

Conversation

alexander-jiang commented Jan 8, 2025 • edited Loading

avinashak Jan 8, 2025

Choose a reason for hiding this comment

avinashak Jan 8, 2025

Choose a reason for hiding this comment

avinashak Jan 8, 2025

Choose a reason for hiding this comment

alexander-jiang Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

avinashak Jan 8, 2025

Choose a reason for hiding this comment

alexander-jiang commented Jan 8, 2025 • edited Loading

alexander-jiang commented Jan 8, 2025

avinashak left a comment

Choose a reason for hiding this comment

alexander-jiang commented Jan 14, 2025 • edited Loading

alexander-jiang commented Jan 14, 2025 • edited Loading

alexander-jiang commented Jan 8, 2025 •

edited

Loading

alexander-jiang Jan 8, 2025 •

edited

Loading

alexander-jiang commented Jan 8, 2025 •

edited

Loading

alexander-jiang commented Jan 14, 2025 •

edited

Loading

alexander-jiang commented Jan 14, 2025 •

edited

Loading