system.log -> messages.log scraping can start lagging badly after connection is lost #9512

piodul · 2024-12-09T11:32:59Z

Argus run that prompted this issue: https://argus.scylladb.com/tests/scylla-cluster-tests/83efb5fa-0232-4b55-a1e1-219764056cee. I have left a comment in the discussion there which explains the problem in more detail. Summarizing:

An event happens which causes Scylla to generate a relatively large amount of logs (it was nodetool rebuild in this run),
The system.log -> messages.log scraping mechanism (I think it's called syslog-ng) loses connection to the node, reconnects after 60s but keeps lagging badly afterwards; I saw that one line was delayed by 16 minutes in that particular run,
The nemesis code which waits for a log line to appear times out, even though it appeared in system.log pretty quickly, leading to test flakiness.

The text was updated successfully, but these errors were encountered:

fruch · 2024-12-09T23:45:29Z

I would argue it's a scylla bug, if it get to more 100K per sec log in one node

also, can you suggest a different way to identify a rebuild has start, without looking at logs ?

roydahan · 2024-12-10T11:20:16Z

What are these logs (the sudden burst) about?

github-actions bot assigned piodul Dec 9, 2024

piodul removed their assignment Dec 9, 2024

Provide feedback