ref(consumer): Add batch write timeout to Rust consumer #6043

ayirr7 · 2024-06-19T00:21:29Z

Adds a new CLI arg to the Rust consumers, batch_write_timeout_ms. This parameter ends up in the batch writer (see batch.rs) where the timeout gets applied to the HTTP client that writes data to Clickhouse.

The timeout would get applied to the entire time period of the connection + the time to receive a response from the server.

The reason behind adding this timeout is to give us a configurable way to control the period of time that a consumer could potentially stall after encountering transients Clickhouse connection errors. We have a current theory that such errors can affect specific partitions of a consumer, leading to slower-than-ideal recovery and a backlog on said partitions. Introducing a timeout could allow the partitions to recover more quickly and possibly prevent from such backlogs forming in the first place.

ayirr7 · 2024-06-21T05:18:33Z

rust_snuba/src/strategies/clickhouse/batch.rs

+        let client = if let Some(timeout_duration) = batch_write_timeout {
+            ClientBuilder::new()
+                .default_headers(headers)
+                .timeout(timeout_duration)


I saw some other options for timeout, e.g. https://docs.rs/reqwest/latest/reqwest/struct.ClientBuilder.html#method.connect_timeout. Might be worth double checking which one we want to use?

this one is fine. it applies to the entire lifecycle of the request.

however, I just realized, it also means that the timeout always needs to be higher than the max batch time...

it also means that the timeout always needs to be higher than the max batch time...

Why is this the case? From https://github.com/getsentry/snuba/blob/master/rust_snuba/src/factory.rs#L130-L149 I thought that the batches are built first (based on either batch time/size) and then packaged as a request to Clickhouse? The place where we are applying the timeout is after the batch has already been built and was closed based on batch time/size, no?

the timeout is for the entire lifecycle of the request, which is from when the batch is being created (=when the http request is being started and starts streaming) to when the batch is flushed (=when the last chunk is being written and we wait for and check the HTTP status code)

Added a constraint in consumer.rs to account for this. It makes sure that batch timeout >= batch time, if it is not, then the batch timeout passed in is silently discarded.

ayirr7 · 2024-06-25T09:07:50Z

Tested locally by starting up a consumer with --batch-write-timeout-ms=0 and sending a message, which triggered the timeout and caused the consumer to crash due to panic with the following error messages:

{"timestamp":"2024-06-25T09:00:14.577750Z","level":"INFO","fields":{"message":"Caught error, terminating strategy"},"target":"rust_arroyo::processing"}
{"timestamp":"2024-06-25T09:00:14.578528Z","level":"INFO","fields":{"message":"Partitions to revoke: [Partition { topic: Topic(\"snuba-generic-metrics\"), index: 0 }]"},"target":"rust_arroyo::processing"}
{"timestamp":"2024-06-25T09:00:14.578653Z","level":"INFO","fields":{"message":"Joining strategy"},"target":"rust_arroyo::processing"}
{"timestamp":"2024-06-25T09:00:14.578771Z","level":"WARN","fields":{"message":"Timeout Some(0ns) reached while waiting for tasks to finish"},"target":"rust_arroyo::processing::strategies::reduce"}
{"timestamp":"2024-06-25T09:00:14.580411Z","level":"INFO","fields":{"message":"Partition revocation complete."},"target":"rust_arroyo::processing"}
{"timestamp":"2024-06-25T09:00:14.849284Z","level":"ERROR","fields":{"message":"Strategy(error sending request for url (http://127.0.0.1:8123/?load_balancing=in_order&insert_distributed_sync=1&query=INSERT+INTO+generic_metric_distributions_raw_local+FORMAT+JSONEachRow): operation timed out\n\nCaused by:\n    operation timed out\n\nStack backtrace:\n   0: std::backtrace_rs::backtrace::libunwind::trace\n             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/ ......

@untitaker If I understand correctly, this would be the desired behavior? Locally, the panic also puts the consumer in a bad state, where if I try to start it again, it crashes again without me sending any data through (but this may just be due to some persistent local state). I modified the unit test as well to reflect that this panic would occur.

Also tested that:

Running consumers without the CLI arg works fine
Running consumers with a batch timeout lower than a configured batch time does not take effect, consumer operates as it normally does

untitaker · 2024-06-25T11:57:53Z

it crashes again without me sending any data through (but this may just be due to some persistent local state

it may have some unprocessed messages from the previous crash, as it didn't commit offsets

anyway, this behavior is fine, it now doesn't hang anymore and that's certainly an improvement with regard to backlog size

ayirr7 added 5 commits June 18, 2024 17:20

add new batch write timeout parameter

fd20804

linting + add main CLI entry arg

e5216d0

linting

84f0047

linitng again

c09aef4

fix linting again

43ae625

ayirr7 changed the title ~~WIP: add new batch write timeout parameter~~ ref(consumer): Add batch write timeout to Rust consumer Jun 20, 2024

ayirr7 added 2 commits June 20, 2024 14:03

add CLI arg to rust consumer entrypoint

92167fd

add tests

120cc16

ayirr7 marked this pull request as ready for review June 21, 2024 02:01

ayirr7 requested a review from a team as a code owner June 21, 2024 02:01

ayirr7 requested a review from a team June 21, 2024 02:01

ayirr7 commented Jun 21, 2024

View reviewed changes

untitaker approved these changes Jun 21, 2024

View reviewed changes

ayirr7 added 6 commits June 24, 2024 12:34

Merge remote-tracking branch 'origin' into add-timeout-to-batch-writer

a91365e

add snapshots ? and add constraint for timeout

3163e6d

clean up timeout constraint

7747c5b

remove snapshots?

296fbe1

Merge remote-tracking branch 'origin' into add-timeout-to-batch-writer

dad7f7a

change test to should panic

173d0cf

ayirr7 merged commit 5b3a54f into master Jun 25, 2024
28 checks passed

ayirr7 deleted the add-timeout-to-batch-writer branch June 25, 2024 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(consumer): Add batch write timeout to Rust consumer #6043

ref(consumer): Add batch write timeout to Rust consumer #6043

ayirr7 commented Jun 19, 2024 •

edited

Loading

ayirr7 Jun 21, 2024

untitaker Jun 21, 2024

ayirr7 Jun 24, 2024 •

edited

Loading

untitaker Jun 24, 2024

ayirr7 Jun 25, 2024

ayirr7 commented Jun 25, 2024 •

edited

Loading

untitaker commented Jun 25, 2024 •

edited

Loading

ref(consumer): Add batch write timeout to Rust consumer #6043

ref(consumer): Add batch write timeout to Rust consumer #6043

Conversation

ayirr7 commented Jun 19, 2024 • edited Loading

ayirr7 Jun 21, 2024

Choose a reason for hiding this comment

untitaker Jun 21, 2024

Choose a reason for hiding this comment

ayirr7 Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

untitaker Jun 24, 2024

Choose a reason for hiding this comment

ayirr7 Jun 25, 2024

Choose a reason for hiding this comment

ayirr7 commented Jun 25, 2024 • edited Loading

untitaker commented Jun 25, 2024 • edited Loading

ayirr7 commented Jun 19, 2024 •

edited

Loading

ayirr7 Jun 24, 2024 •

edited

Loading

ayirr7 commented Jun 25, 2024 •

edited

Loading

untitaker commented Jun 25, 2024 •

edited

Loading