-
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ref(consumer): Add batch write timeout to Rust consumer #6043
Conversation
let client = if let Some(timeout_duration) = batch_write_timeout { | ||
ClientBuilder::new() | ||
.default_headers(headers) | ||
.timeout(timeout_duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw some other options for timeout, e.g. https://docs.rs/reqwest/latest/reqwest/struct.ClientBuilder.html#method.connect_timeout. Might be worth double checking which one we want to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is fine. it applies to the entire lifecycle of the request.
however, I just realized, it also means that the timeout always needs to be higher than the max batch time...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it also means that the timeout always needs to be higher than the max batch time...
Why is this the case? From https://github.com/getsentry/snuba/blob/master/rust_snuba/src/factory.rs#L130-L149 I thought that the batches are built first (based on either batch time/size) and then packaged as a request to Clickhouse? The place where we are applying the timeout is after the batch has already been built and was closed based on batch time/size, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the timeout is for the entire lifecycle of the request, which is from when the batch is being created (=when the http request is being started and starts streaming) to when the batch is flushed (=when the last chunk is being written and we wait for and check the HTTP status code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a constraint in consumer.rs
to account for this. It makes sure that batch timeout >= batch time, if it is not, then the batch timeout passed in is silently discarded.
Tested locally by starting up a consumer with
@untitaker If I understand correctly, this would be the desired behavior? Locally, the panic also puts the consumer in a bad state, where if I try to start it again, it crashes again without me sending any data through (but this may just be due to some persistent local state). I modified the unit test as well to reflect that this panic would occur. Also tested that:
|
it may have some unprocessed messages from the previous crash, as it didn't commit offsets anyway, this behavior is fine, it now doesn't hang anymore and that's certainly an improvement with regard to backlog size |
Adds a new CLI arg to the Rust consumers,
batch_write_timeout_ms
. This parameter ends up in the batch writer (seebatch.rs
) where the timeout gets applied to the HTTP client that writes data to Clickhouse.The timeout would get applied to the entire time period of the connection + the time to receive a response from the server.
The reason behind adding this timeout is to give us a configurable way to control the period of time that a consumer could potentially stall after encountering transients Clickhouse connection errors. We have a current theory that such errors can affect specific partitions of a consumer, leading to slower-than-ideal recovery and a backlog on said partitions. Introducing a timeout could allow the partitions to recover more quickly and possibly prevent from such backlogs forming in the first place.