Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(consumer): Add batch write timeout to Rust consumer #6043

Merged
merged 13 commits into from
Jun 25, 2024

Conversation

ayirr7
Copy link
Member

@ayirr7 ayirr7 commented Jun 19, 2024

Adds a new CLI arg to the Rust consumers, batch_write_timeout_ms. This parameter ends up in the batch writer (see batch.rs) where the timeout gets applied to the HTTP client that writes data to Clickhouse.

The timeout would get applied to the entire time period of the connection + the time to receive a response from the server.

The reason behind adding this timeout is to give us a configurable way to control the period of time that a consumer could potentially stall after encountering transients Clickhouse connection errors. We have a current theory that such errors can affect specific partitions of a consumer, leading to slower-than-ideal recovery and a backlog on said partitions. Introducing a timeout could allow the partitions to recover more quickly and possibly prevent from such backlogs forming in the first place.

@ayirr7 ayirr7 changed the title WIP: add new batch write timeout parameter ref(consumer): Add batch write timeout to Rust consumer Jun 20, 2024
@ayirr7 ayirr7 marked this pull request as ready for review June 21, 2024 02:01
@ayirr7 ayirr7 requested a review from a team as a code owner June 21, 2024 02:01
@ayirr7 ayirr7 requested a review from a team June 21, 2024 02:01
let client = if let Some(timeout_duration) = batch_write_timeout {
ClientBuilder::new()
.default_headers(headers)
.timeout(timeout_duration)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some other options for timeout, e.g. https://docs.rs/reqwest/latest/reqwest/struct.ClientBuilder.html#method.connect_timeout. Might be worth double checking which one we want to use?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is fine. it applies to the entire lifecycle of the request.

however, I just realized, it also means that the timeout always needs to be higher than the max batch time...

Copy link
Member Author

@ayirr7 ayirr7 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also means that the timeout always needs to be higher than the max batch time...

Why is this the case? From https://github.com/getsentry/snuba/blob/master/rust_snuba/src/factory.rs#L130-L149 I thought that the batches are built first (based on either batch time/size) and then packaged as a request to Clickhouse? The place where we are applying the timeout is after the batch has already been built and was closed based on batch time/size, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the timeout is for the entire lifecycle of the request, which is from when the batch is being created (=when the http request is being started and starts streaming) to when the batch is flushed (=when the last chunk is being written and we wait for and check the HTTP status code)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a constraint in consumer.rs to account for this. It makes sure that batch timeout >= batch time, if it is not, then the batch timeout passed in is silently discarded.

@ayirr7
Copy link
Member Author

ayirr7 commented Jun 25, 2024

Tested locally by starting up a consumer with --batch-write-timeout-ms=0 and sending a message, which triggered the timeout and caused the consumer to crash due to panic with the following error messages:

{"timestamp":"2024-06-25T09:00:14.577750Z","level":"INFO","fields":{"message":"Caught error, terminating strategy"},"target":"rust_arroyo::processing"}
{"timestamp":"2024-06-25T09:00:14.578528Z","level":"INFO","fields":{"message":"Partitions to revoke: [Partition { topic: Topic(\"snuba-generic-metrics\"), index: 0 }]"},"target":"rust_arroyo::processing"}
{"timestamp":"2024-06-25T09:00:14.578653Z","level":"INFO","fields":{"message":"Joining strategy"},"target":"rust_arroyo::processing"}
{"timestamp":"2024-06-25T09:00:14.578771Z","level":"WARN","fields":{"message":"Timeout Some(0ns) reached while waiting for tasks to finish"},"target":"rust_arroyo::processing::strategies::reduce"}
{"timestamp":"2024-06-25T09:00:14.580411Z","level":"INFO","fields":{"message":"Partition revocation complete."},"target":"rust_arroyo::processing"}
{"timestamp":"2024-06-25T09:00:14.849284Z","level":"ERROR","fields":{"message":"Strategy(error sending request for url (http://127.0.0.1:8123/?load_balancing=in_order&insert_distributed_sync=1&query=INSERT+INTO+generic_metric_distributions_raw_local+FORMAT+JSONEachRow): operation timed out\n\nCaused by:\n    operation timed out\n\nStack backtrace:\n   0: std::backtrace_rs::backtrace::libunwind::trace\n             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/ ......

@untitaker If I understand correctly, this would be the desired behavior? Locally, the panic also puts the consumer in a bad state, where if I try to start it again, it crashes again without me sending any data through (but this may just be due to some persistent local state). I modified the unit test as well to reflect that this panic would occur.

Also tested that:

  • Running consumers without the CLI arg works fine
  • Running consumers with a batch timeout lower than a configured batch time does not take effect, consumer operates as it normally does

@untitaker
Copy link
Member

untitaker commented Jun 25, 2024

it crashes again without me sending any data through (but this may just be due to some persistent local state

it may have some unprocessed messages from the previous crash, as it didn't commit offsets

anyway, this behavior is fine, it now doesn't hang anymore and that's certainly an improvement with regard to backlog size

@ayirr7 ayirr7 merged commit 5b3a54f into master Jun 25, 2024
28 checks passed
@ayirr7 ayirr7 deleted the add-timeout-to-batch-writer branch June 25, 2024 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants