-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return non-zero exit code for non graceful shutdown #8
Return non-zero exit code for non graceful shutdown #8
Conversation
src/topology/running.rs
Outdated
// Panic to ensure that Vector returns a non-zero exit code when it's unable to gracefully shutdown | ||
futures::future::join(source_shutdown_complete, shutdown_complete_future) | ||
.map(|futures| match futures.1.0 { | ||
Result::Err(_) => panic!("failed to gracefully shutdown in time, panic to force non-zero exit code"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to avoid a panic!
here without changing the interface. Given this seems to be returning a Future
, can we wrap a Result
in a future instead and unwrap that in the exit handlers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I guess the output future is bounded to contain a unit type here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexander-jiang what do u think of panic!
ing after you collect all the values instead of as soon as you see the first error? this will let all the components shutdown before we engage the panic trap to crash out when we detect a bad shutdown.
also we should tag this to the team for review more widely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is the future::join
will wait for both the futures (source_shutdown_complete
and shutdown_complete_future
) to complete before returning, so we only panic after waiting for both of those futures to complete.
The shutdown_complete_future
uses a future::select_all
over the futures, which waits until any of specified futures are ready, however, in that case I think it is correct to not wait for all - the select_all
will wait until the deadline is exceeded (the timeout
future) or until all the tasks are done (the success
future), the reporter
future is an infinite loop that writes a log message/event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small correction i think - the select_all
handler in https://github.com/discord/vector/pull/8/files#diff-bdafd184965fb9c8ceeb9213965a2d5dbe43037d8f5f3501f3d0fb9d11b4cc8fR197 waits till the first event between - timeout
, success
or the reporting(currently running)
components returns.
which makes me think we should also declare the timeout scenario likely to also be erroneous even if any component was not having a reporting loop for completion? 💭 though this scenario is likely not to manifest from my interpretation
testing with an intentionally broken pubsub sink component and then sending a SIGTERM to the vector process directly (using (note the panic message below
|
Some notes on the vector code that I'm patching here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requesting a few test scenarios be run before merging this:
- add two broken sinks and ensures both run till timeout
- run the broken sinks scenarios multiple times to check if there are potential cases where the components don't wait for completion for completeness
- run both broken and healthy sink configurations to ensure shutdown covers all components
e9f720f
to
2fe231a
Compare
… initialized and then catches up to the writer, it waits for the writer to notify it (but the writer will never write more data in the vector buffer recovery topology/pipeline)
Update - I tested again:
|
2a7cc57
into
data_infra_vector_stable
I didn't squash the commits correctly when merging so I did a manual rebase and force push on the remote data_infra_vector_stable branch |
When testing in staging with a broken pubsub sink (& healthcheck disabled so that vector will start), the vector process returns a exit code of 0, even though it fails to shut down the sink component within the graceful shutdown time limit.
This change will update the behavior so that Vector shutsdown with a non-zero exit code (which we can use later on to determine whether vector was gracefully shutdown or not, so we can determine whether or not to run buffer recovery)
test cases: see PR comment below