Akka Persistence Cassandra retry after DB exceptions #847

mckeeh3 · 2020-12-01T21:34:47Z

Running tests with Akka Micorservices in Amazon EKS and Keyspaces for Cassandra. When exceptions are thrown due to DB operation timeouts the below queries are retried every few seconds. With version 1.0.3 under the same conditions when these same exceptions occur circuit breakers opened and the retry rate slowed down. Testing indicated that the 1.0.3 behavior recovered more quickly than 1.0.4.

Versions used

Akka version: akka persistence Cassandra 1.0.4

Expected Behavior

When exceptions like this occur expect some circuit breaker reaction with backoff before retry.

Actual Behavior

The below query is performed with no apparent backoff. Queries are retried every few seconds.

Relevant logs

woe-twin-749df6f759-q9kpn woe-twin [2020-12-01 16:28:40,103] [ERROR] [com.datastax.oss.driver.internal.core.tracker.RequestLogger] [] [s0-io-4] - [s0|1919515490][Node(endPoint=3.234.248.222/3.234.248.222:9142, hostId=fd0e98e7-ad76-310d-9094-065cea862868, hashCode=18dbebab)] Error (2 ms) [4 values]
woe-twin-749df6f759-q9kpn woe-twin SELECT persistence_id, tag_pid_sequence_nr, timestamp
woe-twin-749df6f759-q9kpn woe-twin FROM woe_twin.tag_views WHERE
woe-twin-749df6f759-q9kpn woe-twin tag_name = ? AND
woe-twin-749df6f759-q9kpn woe-twin timebucket = ? AND
woe-twin-749df6f759-q9kpn woe-twin timestamp > ? AND
woe-twin-749df6f759-q9kpn woe-twin timestamp <= ? [tag_name='0', timebucket=1606838400000, timestamp=2008b020-33f0-11eb-8080-808080808080, timestamp=2303a0a0-33f0-11eb-ad7d-478fe0f9131b] MDC: {}
woe-twin-749df6f759-q9kpn woe-twin com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded)
woe-twin-749df6f759-q9kpn woe-twin [2020-12-01 16:28:40,103] [WARN] [akka.persistence.cassandra.query.EventsByTagStage] [] [woe-twin-akka.actor.default-dispatcher-5] - Backtrack failed, this will retried. com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded) MDC: {akkaAddress=akka://[email protected]:25520, sourceThread=woe-twin-akka.actor.default-dispatcher-5, akkaSource=EventsByTagStage(akka://woe-twin), sourceActorSystem=woe-twin, akkaTimestamp=16:28:40.103UTC}
woe-twin-749df6f759-77xqb woe-twin [2020-12-01 16:28:40,143] [ERROR] [com.datastax.oss.driver.internal.core.tracker.RequestLogger] [] [s0-io-3] - [s0|1245995343][Node(endPoint=3.234.248.223/3.234.248.223:9142, hostId=8f3b894b-760a-38c2-9db9-bcdc6cf66ef2, hashCode=5192550a)] Error (3 ms) [4 values]
woe-twin-749df6f759-77xqb woe-twin SELECT * FROM woe_twin.tag_views WHERE
woe-twin-749df6f759-77xqb woe-twin tag_name = ? AND
woe-twin-749df6f759-77xqb woe-twin timebucket = ? AND
woe-twin-749df6f759-77xqb woe-twin timestamp > ? AND
woe-twin-749df6f759-77xqb woe-twin timestamp < ?
woe-twin-749df6f759-77xqb woe-twin ORDER BY timestamp ASC
woe-twin-749df6f759-77xqb woe-twin [tag_name='0', timebucket=1606838400000, timestamp=2303a0a0-33f0-11eb-ad7d-478fe0f9131b, timestamp=4194b8df-33f2-11eb-7f7f-7f7f7f7f7f7f] MDC: {}
woe-twin-749df6f759-77xqb woe-twin com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 0 replica responded)

Reproducible Test Case

The problem occurs when Cassandra is pushed to the point that some operations are timing out.

mckeeh3 · 2020-12-02T12:19:37Z

From the Amazon Keyspaces contact, Michael Raney.
"If you made a request that was rejected for a hotkey that’s not counted towards the limit itself. It’s sort of blocked at the front door.

There are two thresholds. 1 for the table and 1 for the individual partitions that make up the partition. If access is too hot on a given partition then it may result in hotkey.

Maybe increase the wait between polling slightly? Few mills."

This appears to be what is happening. The DB is throwing exceptions and the queries retry often, which keeps the DB in an overloaded state.

patriknw · 2020-12-02T17:44:55Z

Thanks for reporting. We'll look into this shortly.

chbatey · 2020-12-07T10:51:53Z

The reties are configurable: https://github.com/akka/akka-persistence-cassandra/blob/master/core/src/main/resources/reference.conf#L357 but maybe there is something broken there.

Can you provide all the query logs? To verify the queries are not backing off to every 30s? Each time there is a successful result the backoff is reset

patriknw · 2020-12-07T12:07:30Z

The retries in eventsByTag works as expected. Exponential backoff according to the configuration, so I don't think that is causing the db overload. The logging of the failed attempts were missing, though, so added in #848

Above log mentions "Backtrack failed, this will retried". I will look into if we retry those queries without delays.

chbatey · 2020-12-07T16:14:32Z

Shall we leave this open for @mckeeh3 to try? Can you try publishing locally Hugh?

patriknw added the 2 - pick next Used to mark issues which are next up in the queue to be worked on. The tag is non-binding label Dec 2, 2020

patriknw self-assigned this Dec 7, 2020

patriknw mentioned this issue Dec 7, 2020

logging of retries eventsByTag #848

Merged

patriknw added a commit that referenced this issue Dec 7, 2020

make scan less aggressive, #847

4e83083

patriknw mentioned this issue Dec 7, 2020

make scan less aggressive, #847 #849

Merged

chbatey pushed a commit that referenced this issue Dec 7, 2020

make scan less aggressive, #847 (#849)

137d16f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akka Persistence Cassandra retry after DB exceptions #847

Akka Persistence Cassandra retry after DB exceptions #847

mckeeh3 commented Dec 1, 2020

mckeeh3 commented Dec 2, 2020

patriknw commented Dec 2, 2020

chbatey commented Dec 7, 2020

patriknw commented Dec 7, 2020

chbatey commented Dec 7, 2020

Akka Persistence Cassandra retry after DB exceptions #847

Akka Persistence Cassandra retry after DB exceptions #847

Comments

mckeeh3 commented Dec 1, 2020

Versions used

Expected Behavior

Actual Behavior

Relevant logs

Reproducible Test Case

mckeeh3 commented Dec 2, 2020

patriknw commented Dec 2, 2020

chbatey commented Dec 7, 2020

patriknw commented Dec 7, 2020

chbatey commented Dec 7, 2020