-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Akka Persistence Cassandra retry after DB exceptions #847
Comments
From the Amazon Keyspaces contact, Michael Raney. There are two thresholds. 1 for the table and 1 for the individual partitions that make up the partition. If access is too hot on a given partition then it may result in hotkey. Maybe increase the wait between polling slightly? Few mills." This appears to be what is happening. The DB is throwing exceptions and the queries retry often, which keeps the DB in an overloaded state. |
Thanks for reporting. We'll look into this shortly. |
The reties are configurable: https://github.com/akka/akka-persistence-cassandra/blob/master/core/src/main/resources/reference.conf#L357 but maybe there is something broken there. Can you provide all the query logs? To verify the queries are not backing off to every 30s? Each time there is a successful result the backoff is reset |
The retries in eventsByTag works as expected. Exponential backoff according to the configuration, so I don't think that is causing the db overload. The logging of the failed attempts were missing, though, so added in #848 Above log mentions "Backtrack failed, this will retried". I will look into if we retry those queries without delays. |
Shall we leave this open for @mckeeh3 to try? Can you try publishing locally Hugh? |
Running tests with Akka Micorservices in Amazon EKS and Keyspaces for Cassandra. When exceptions are thrown due to DB operation timeouts the below queries are retried every few seconds. With version 1.0.3 under the same conditions when these same exceptions occur circuit breakers opened and the retry rate slowed down. Testing indicated that the 1.0.3 behavior recovered more quickly than 1.0.4.
Versions used
Akka version: akka persistence Cassandra 1.0.4
Expected Behavior
When exceptions like this occur expect some circuit breaker reaction with backoff before retry.
Actual Behavior
The below query is performed with no apparent backoff. Queries are retried every few seconds.
Relevant logs
woe-twin-749df6f759-q9kpn woe-twin [2020-12-01 16:28:40,103] [ERROR] [com.datastax.oss.driver.internal.core.tracker.RequestLogger] [] [s0-io-4] - [s0|1919515490][Node(endPoint=3.234.248.222/3.234.248.222:9142, hostId=fd0e98e7-ad76-310d-9094-065cea862868, hashCode=18dbebab)] Error (2 ms) [4 values]
woe-twin-749df6f759-q9kpn woe-twin SELECT persistence_id, tag_pid_sequence_nr, timestamp
woe-twin-749df6f759-q9kpn woe-twin FROM woe_twin.tag_views WHERE
woe-twin-749df6f759-q9kpn woe-twin tag_name = ? AND
woe-twin-749df6f759-q9kpn woe-twin timebucket = ? AND
woe-twin-749df6f759-q9kpn woe-twin timestamp > ? AND
woe-twin-749df6f759-q9kpn woe-twin timestamp <= ? [tag_name='0', timebucket=1606838400000, timestamp=2008b020-33f0-11eb-8080-808080808080, timestamp=2303a0a0-33f0-11eb-ad7d-478fe0f9131b] MDC: {}
woe-twin-749df6f759-q9kpn woe-twin com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded)
woe-twin-749df6f759-q9kpn woe-twin [2020-12-01 16:28:40,103] [WARN] [akka.persistence.cassandra.query.EventsByTagStage] [] [woe-twin-akka.actor.default-dispatcher-5] - Backtrack failed, this will retried. com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded) MDC: {akkaAddress=akka://[email protected]:25520, sourceThread=woe-twin-akka.actor.default-dispatcher-5, akkaSource=EventsByTagStage(akka://woe-twin), sourceActorSystem=woe-twin, akkaTimestamp=16:28:40.103UTC}
woe-twin-749df6f759-77xqb woe-twin [2020-12-01 16:28:40,143] [ERROR] [com.datastax.oss.driver.internal.core.tracker.RequestLogger] [] [s0-io-3] - [s0|1245995343][Node(endPoint=3.234.248.223/3.234.248.223:9142, hostId=8f3b894b-760a-38c2-9db9-bcdc6cf66ef2, hashCode=5192550a)] Error (3 ms) [4 values]
woe-twin-749df6f759-77xqb woe-twin SELECT * FROM woe_twin.tag_views WHERE
woe-twin-749df6f759-77xqb woe-twin tag_name = ? AND
woe-twin-749df6f759-77xqb woe-twin timebucket = ? AND
woe-twin-749df6f759-77xqb woe-twin timestamp > ? AND
woe-twin-749df6f759-77xqb woe-twin timestamp < ?
woe-twin-749df6f759-77xqb woe-twin ORDER BY timestamp ASC
woe-twin-749df6f759-77xqb woe-twin [tag_name='0', timebucket=1606838400000, timestamp=2303a0a0-33f0-11eb-ad7d-478fe0f9131b, timestamp=4194b8df-33f2-11eb-7f7f-7f7f7f7f7f7f] MDC: {}
woe-twin-749df6f759-77xqb woe-twin com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded)
Reproducible Test Case
The problem occurs when Cassandra is pushed to the point that some operations are timing out.
The text was updated successfully, but these errors were encountered: