-
Notifications
You must be signed in to change notification settings - Fork 3.9k
SuperStream doesn't elect the single active consumer #7743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A group of consumers on a super stream can end up blocked without an active consumer. This can happen with consumer churn: one consumer gets removed, which makes the active consumer passive, but the former active consumer never gets to know because it has been removed itself. This commit changes the structure of the messages the SAC coordinator sends to consumer connections, to embed enough information to look up the group and to instruct it to choose a new active consumer when the race condition mentioned above comes up. Because of the changes in the structure of messages, a feature flag is required to make sure the SAC coordinator starts sending the new messages only when all the nodes have been upgraded. References #7743
A group of consumers on a super stream can end up blocked without an active consumer. This can happen with consumer churn: one consumer gets removed, which makes the active consumer passive, but the former active consumer never gets to know because it has been removed itself. This commit changes the structure of the messages the SAC coordinator sends to consumer connections, to embed enough information to look up the group and to instruct it to choose a new active consumer when the race condition mentioned above comes up. Because of the changes in the structure of messages, a feature flag is required to make sure the SAC coordinator starts sending the new messages only when all the nodes have been upgraded. References #7743
The stream plugin can send frames to a client connection and expect a response from it. This is used currently for the consumer_update frame (single active consumer feature). There was no timeout mechanism so far, so a slow or blocked application could prevent a group of consumers to move on. This commit introduces a timeout mechanism: if the expected response takes too long to arrive, the server assumes the connection is blocked and closes it. The default timeout is 60 seconds but it can be changed by setting the request_timeout parameter of the rabbitmq_stream application. Note the mechanism does not enforce the exact duration of the timeout, as a timer is set for the first request and re-used for other requests. With bad timing, a request can time out after twice as long as the set-up timeout. References #7743
A group of consumers on a super stream can end up blocked without an active consumer. This can happen with consumer churn: one consumer gets removed, which makes the active consumer passive, but the former active consumer never gets to know because it has been removed itself. This commit changes the structure of the messages the SAC coordinator sends to consumer connections, to embed enough information to look up the group and to instruct it to choose a new active consumer when the race condition mentioned above comes up. Because of the changes in the structure of messages, a feature flag is required to make sure the SAC coordinator starts sending the new messages only when all the nodes have been upgraded. References #7743
The stream plugin can send frames to a client connection and expect a response from it. This is used currently for the consumer_update frame (single active consumer feature). There was no timeout mechanism so far, so a slow or blocked application could prevent a group of consumers to move on. This commit introduces a timeout mechanism: if the expected response takes too long to arrive, the server assumes the connection is blocked and closes it. The default timeout is 60 seconds but it can be changed by setting the request_timeout parameter of the rabbitmq_stream application. Note the mechanism does not enforce the exact duration of the timeout, as a timer is set for the first request and re-used for other requests. With bad timing, a request can time out after twice as long as the set-up timeout. References #7743
A group of consumers on a super stream can end up blocked without an active consumer. This can happen with consumer churn: one consumer gets removed, which makes the active consumer passive, but the former active consumer never gets to know because it has been removed itself. This commit changes the structure of the messages the SAC coordinator sends to consumer connections, to embed enough information to look up the group and to instruct it to choose a new active consumer when the race condition mentioned above comes up. Because of the changes in the structure of messages, a feature flag is required to make sure the SAC coordinator starts sending the new messages only when all the nodes have been upgraded. References #7743 (cherry picked from commit 70538c5)
The stream plugin can send frames to a client connection and expect a response from it. This is used currently for the consumer_update frame (single active consumer feature). There was no timeout mechanism so far, so a slow or blocked application could prevent a group of consumers to move on. This commit introduces a timeout mechanism: if the expected response takes too long to arrive, the server assumes the connection is blocked and closes it. The default timeout is 60 seconds but it can be changed by setting the request_timeout parameter of the rabbitmq_stream application. Note the mechanism does not enforce the exact duration of the timeout, as a timer is set for the first request and re-used for other requests. With bad timing, a request can time out after twice as long as the set-up timeout. References #7743 (cherry picked from commit 763acc2)
A group of consumers on a super stream can end up blocked without an active consumer. This can happen with consumer churn: one consumer gets removed, which makes the active consumer passive, but the former active consumer never gets to know because it has been removed itself. This commit changes the structure of the messages the SAC coordinator sends to consumer connections, to embed enough information to look up the group and to instruct it to choose a new active consumer when the race condition mentioned above comes up. Because of the changes in the structure of messages, a feature flag is required to make sure the SAC coordinator starts sending the new messages only when all the nodes have been upgraded. References #7743 (cherry picked from commit 70538c5) (cherry picked from commit 221f10d) # Conflicts: # deps/rabbit/src/rabbit_core_ff.erl # deps/rabbit/src/rabbit_stream_sac_coordinator.erl # deps/rabbitmq_stream/src/rabbit_stream_reader.erl
The stream plugin can send frames to a client connection and expect a response from it. This is used currently for the consumer_update frame (single active consumer feature). There was no timeout mechanism so far, so a slow or blocked application could prevent a group of consumers to move on. This commit introduces a timeout mechanism: if the expected response takes too long to arrive, the server assumes the connection is blocked and closes it. The default timeout is 60 seconds but it can be changed by setting the request_timeout parameter of the rabbitmq_stream application. Note the mechanism does not enforce the exact duration of the timeout, as a timer is set for the first request and re-used for other requests. With bad timing, a request can time out after twice as long as the set-up timeout. References #7743 (cherry picked from commit 763acc2) (cherry picked from commit 62d016d) # Conflicts: # deps/rabbitmq_stream/src/rabbit_stream_reader.erl # deps/rabbitmq_stream/test/rabbit_stream_SUITE.erl
They can be useful and are not on hot paths, but they are replicated on all nodes as part of the state machine replication, so we are better off removing them to avoid noise. References #7743
@fabiorosa-sn have you enabled |
yes, it's enabled. we are using the java stream client version 0.15.0 |
Here is the result of the command
as requested by @Gsantomaggio |
@Gsantomaggio This issue still exists, we are on We are able to reproduce the issue with 5 active consumers and restarting all of them a couple of times. Looking at the screenshots you can see that after some restarts one stream does not have any active consumer, with all consumers being in waiting state. Can you give some more hints on how to diagnose this further? Thanks! |
@msvechla the most helpful step towards solving this problem would be to provide the exact steps for reproducing the issue (even if one of the steps is non-deterministic, like "restart a few times"). Ideally you could use https://github.com/rabbitmq/rabbitmq-stream-perf-test so that we can easily follow the steps, but you can also use your own code, but provide it in a complete, executable format (a GH repo we can clone and build). Please be precise: start with a fresh RabbitMQ, declare the super-stream (show us how you do it), then run stream-perf-test (or your app) - provide the exact command, etc. If you can put together such a test case, please open a new issue with all the details. |
@mkuratczyk thanks. As we are currently not running this on our infrastructure, but on cloudamqp, we will also check with the support. Thanks for providing the instructions, we will work on such a reproducible setup in case we can not find the root-cause there. Just for my understanding: The scenario as shown in my screenshots above, as well as others in this issue here, where there are multiple connected consumers, but all of them are in waiting state, should never happen right? Why is the RabbitMQ server not determining that there is no active consumer and attempting to elect a new one? Should we not see this in the RabbitMQ server logs somewhere? |
Yes, it should pick one of the available consumers.
Well, that's what we need to understand and reproduction steps would help grately. :) |
Hi, i can confirm that the issue still exists, running 3.13.7. This happened after a node restart (in a cluster of 3 rabbitmq nodes). Don't know if it matters, but our clusters are relative big, with around 1000 superstreams.... so i guess there's ALOT of SAC-switching happening when we restart a rmq-node |
Can confirm, issue is still reproducible on 3.13.7 |
@dukhaSlayer There were several stream SAC changes considered or adopted for Time to upgrade. |
Thank you, @Gsantomaggio. That's indeed relevant, just hasn't been finished or shipped yet :) |
Describe the bug
The super stream doesn't elect the single active consumer when the consumers are restarted.

The consumers stop consuming and the status is always in
waiting
status. see the image:Reproduction steps
rabbitmq-streams add_super_stream invoices --partitions 10
Expected behavior
One single active consumer has to be active
Additional context
I noticed that the
invoices-1
is usually the first portion to have problems.The
invoices-0
partition usually worksThe other partitions, at some point, will have the same issue.
RabbitMQ 3.11.11
Java RabbitMQ Stream / Java 0.10.0-SNAPSHOT
The text was updated successfully, but these errors were encountered: