Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return non-zero exit code for non graceful shutdown #8

Merged

Conversation

alexander-jiang
Copy link

@alexander-jiang alexander-jiang commented Jan 8, 2025

When testing in staging with a broken pubsub sink (& healthcheck disabled so that vector will start), the vector process returns a exit code of 0, even though it fails to shut down the sink component within the graceful shutdown time limit.

This change will update the behavior so that Vector shutsdown with a non-zero exit code (which we can use later on to determine whether vector was gracefully shutdown or not, so we can determine whether or not to run buffer recovery)

test cases: see PR comment below

src/topology/running.rs Fixed Show fixed Hide fixed
// Panic to ensure that Vector returns a non-zero exit code when it's unable to gracefully shutdown
futures::future::join(source_shutdown_complete, shutdown_complete_future)
.map(|futures| match futures.1.0 {
Result::Err(_) => panic!("failed to gracefully shutdown in time, panic to force non-zero exit code"),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to avoid a panic! here without changing the interface. Given this seems to be returning a Future, can we wrap a Result in a future instead and unwrap that in the exit handlers?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I guess the output future is bounded to contain a unit type here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexander-jiang what do u think of panic! ing after you collect all the values instead of as soon as you see the first error? this will let all the components shutdown before we engage the panic trap to crash out when we detect a bad shutdown.

also we should tag this to the team for review more widely

Copy link
Author

@alexander-jiang alexander-jiang Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is the future::join will wait for both the futures (source_shutdown_complete and shutdown_complete_future ) to complete before returning, so we only panic after waiting for both of those futures to complete.

The shutdown_complete_future uses a future::select_all over the futures, which waits until any of specified futures are ready, however, in that case I think it is correct to not wait for all - the select_all will wait until the deadline is exceeded (the timeout future) or until all the tasks are done (the success future), the reporter future is an infinite loop that writes a log message/event.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small correction i think - the select_all handler in https://github.com/discord/vector/pull/8/files#diff-bdafd184965fb9c8ceeb9213965a2d5dbe43037d8f5f3501f3d0fb9d11b4cc8fR197 waits till the first event between - timeout, success or the reporting(currently running) components returns.

which makes me think we should also declare the timeout scenario likely to also be erroneous even if any component was not having a reporting loop for completion? 💭 though this scenario is likely not to manifest from my interpretation

@alexander-jiang
Copy link
Author

alexander-jiang commented Jan 8, 2025

testing with an intentionally broken pubsub sink component and then sending a SIGTERM to the vector process directly (using kill -s SIGTERM {vector_pid} on the pod)

(note the panic message below alexj's panic: failed to gracefully shutdown in time is different from the panic message in the latest version of the PR, I updated it after running this test)

{"timestamp":"2025-01-08T19:44:02.730800Z","level":"INFO","message":"Signal received.","signal":"SIGTERM","target":"vector::signal"}
{"timestamp":"2025-01-08T19:44:02.730854Z","level":"INFO","message":"Vector has stopped.","target":"vector"}
[...]
{"timestamp":"2025-01-08T19:44:07.731863Z","level":"INFO","message":"Shutting down... Waiting on running components.","remaining_components":"\"event_collector_sender\"","time_remaining":"\"9 seconds left\"","target":"vector::topology::running"}
{"timestamp":"2025-01-08T19:44:12.731662Z","level":"INFO","message":"Shutting down... Waiting on running components.","remaining_components":"\"event_collector_sender\"","time_remaining":"\"4 seconds left\"","target":"vector::topology::running"}
[...]
thread 'main' panicked at src/topology/running.rs:208:35:
alexj's panic: failed to gracefully shutdown in time
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
{"timestamp":"2025-01-08T19:44:17.731648Z","level":"ERROR","message":"Failed to gracefully shut down in time. Killing components.","components":"\"event_collector_sender\"","target":"vector::topology::running"}
2025-01-08 19:44:17,742 WARN exited: vector (exit status 101; not expected)
vector exited with non-zero code = 101: vector did not shut down gracefully
Removing buffer.lock file
Able to reach GCS bucket vector-stg
Initial buffer directory size = 445999928
starting buffer recovery...
buffer directory size decreased, current size = 311789016
buffer directory size decreased, current size = 177575120
buffer directory size decreased, current size = 43367360
Terminating buffer recovery vector process...
2025-01-08 19:44:39,118 WARN received SIGTERM indicating exit request
2025-01-08 19:44:39,118 INFO waiting for buffer_recovery to die
removing vector socket /tmp/vector.sock
sending sigterm to parent PID 7
2025-01-08 19:44:39,155 INFO exited: buffer_recovery (exit status 0; expected)

@alexander-jiang
Copy link
Author

Some notes on the vector code that I'm patching here:
the RunningTopology.stop() function (in src/topology/running.rs) is called by the TopologyController.stop() function (in src/topology/controller.rs), which is called by the FinishedApplication.stop() function -- this last call is I believe what is called when the vector process receives a shutdown signal (see the os_signals function in src/signal.rs -> SIGTERM maps to SignalTo::Shutdown)

@avinashak avinashak self-requested a review January 9, 2025 18:27
Copy link

@avinashak avinashak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requesting a few test scenarios be run before merging this:

  • add two broken sinks and ensures both run till timeout
  • run the broken sinks scenarios multiple times to check if there are potential cases where the components don't wait for completion for completeness
  • run both broken and healthy sink configurations to ensure shutdown covers all components

@alexander-jiang alexander-jiang force-pushed the panic-for-non-graceful-shutdown branch from e9f720f to 2fe231a Compare January 9, 2025 19:50
@alexander-jiang
Copy link
Author

alexander-jiang commented Jan 14, 2025

Update - I tested again:

  • with the happy path (no sink configs broken) and when I sent a SIGTERM to the vector process pid directly (i.e. kill -s SIGTERM {pid}) , the exit code was 0 (as expected)
  • with the unhappy path (one broken pubsub sink config: invalid topic name) and when I sent a SIGTERM to the vector process pid directly (i.e. kill -s SIGTERM {pid}) , the exit code was not 0 (was 101, due to the panic), as expected
  • with the happy path, when I killed the pod (kubectl delete pod {pod_name}) - the exit code was unknown (the vector_start.sh script did not get a chance to write the exit code)
  • with unhappy path, where I added one broken pubsub sink config but kept one healthy pubsub sink config - then when I sent the SIGTERM to the vector process pid, the exit code was not 0 (as expected)

@alexander-jiang alexander-jiang merged commit 2a7cc57 into data_infra_vector_stable Jan 14, 2025
44 of 48 checks passed
@alexander-jiang
Copy link
Author

alexander-jiang commented Jan 14, 2025

I didn't squash the commits correctly when merging so I did a manual rebase and force push on the remote data_infra_vector_stable branch

@alexander-jiang alexander-jiang deleted the panic-for-non-graceful-shutdown branch January 14, 2025 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants