SuperStream doesn't elect the single active consumer #7743

Gsantomaggio · 2023-03-27T09:30:10Z

Describe the bug

The super stream doesn't elect the single active consumer when the consumers are restarted.
The consumers stop consuming and the status is always in waiting status. see the image:

Reproduction steps

Create super stream with ten partitions rabbitmq-streams add_super_stream invoices --partitions 10
pump it with a few thousand messages (70k is enough)
stop the producer
start ten instances of this java client:

    public static void main(String[] args) throws IOException {

        System.out.println("Connecting...");
        Address entryPoint = new Address("127.0.0.1", 5552);


        Environment environment = Environment.builder()
//                .host(entryPoint.host())
//                .port(entryPoint.port())
//                .username("test")
//                .password("test")
                .addressResolver(address -> entryPoint)
                .maxConsumersByConnection(1).
                build();
        String AppName = "reference";
        String stream = "invoices";

        AtomicInteger consumed = new AtomicInteger();
        Date start = new Date();
        for (int i = 0; i < 500; i++) {


            Map<String, Integer> consumedMap = new HashMap<>();
            Consumer consumer = environment.consumerBuilder()
                    .superStream("invoices")
                    .name("reference")
                    .offset(OffsetSpecification.first())
                    .singleActiveConsumer()
                    .messageHandler((context, message) -> {

                        if (consumedMap.containsKey(stream)) {
                            consumedMap.put(stream, consumedMap.get(stream) + 1);
                        } else {
                            consumedMap.put(stream, 1);
                        }

                        if (consumedMap.get(stream) % 10 == 0) {
                            Date end = new Date();
                            System.out.println("Stream: " + context.stream() + " - Consumed " + consumedMap.get(stream) + " - Time " + (end.getTime() - start.getTime()));
                        }


                        try {
                            Thread.sleep(ThreadLocalRandom.current().nextInt(200, 1000));
                        } catch (InterruptedException e) {
                            throw new RuntimeException(e);
                        }

                    }).build();

            try {
                Thread.sleep(60000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
            System.out.println("Restarting");

            consumer.close();
        }

    }

Wait a couple of restarts
you will have the issue

Expected behavior

One single active consumer has to be active

Additional context

I noticed that the invoices-1 is usually the first portion to have problems.
The invoices-0 partition usually works
The other partitions, at some point, will have the same issue.

RabbitMQ 3.11.11
Java RabbitMQ Stream / Java 0.10.0-SNAPSHOT

The text was updated successfully, but these errors were encountered:

A group of consumers on a super stream can end up blocked without an active consumer. This can happen with consumer churn: one consumer gets removed, which makes the active consumer passive, but the former active consumer never gets to know because it has been removed itself. This commit changes the structure of the messages the SAC coordinator sends to consumer connections, to embed enough information to look up the group and to instruct it to choose a new active consumer when the race condition mentioned above comes up. Because of the changes in the structure of messages, a feature flag is required to make sure the SAC coordinator starts sending the new messages only when all the nodes have been upgraded. References #7743

References #7743

A group of consumers on a super stream can end up blocked without an active consumer. This can happen with consumer churn: one consumer gets removed, which makes the active consumer passive, but the former active consumer never gets to know because it has been removed itself. This commit changes the structure of the messages the SAC coordinator sends to consumer connections, to embed enough information to look up the group and to instruct it to choose a new active consumer when the race condition mentioned above comes up. Because of the changes in the structure of messages, a feature flag is required to make sure the SAC coordinator starts sending the new messages only when all the nodes have been upgraded. References #7743

References #7743

The stream plugin can send frames to a client connection and expect a response from it. This is used currently for the consumer_update frame (single active consumer feature). There was no timeout mechanism so far, so a slow or blocked application could prevent a group of consumers to move on. This commit introduces a timeout mechanism: if the expected response takes too long to arrive, the server assumes the connection is blocked and closes it. The default timeout is 60 seconds but it can be changed by setting the request_timeout parameter of the rabbitmq_stream application. Note the mechanism does not enforce the exact duration of the timeout, as a timer is set for the first request and re-used for other requests. With bad timing, a request can time out after twice as long as the set-up timeout. References #7743

References #7743

A group of consumers on a super stream can end up blocked without an active consumer. This can happen with consumer churn: one consumer gets removed, which makes the active consumer passive, but the former active consumer never gets to know because it has been removed itself. This commit changes the structure of the messages the SAC coordinator sends to consumer connections, to embed enough information to look up the group and to instruct it to choose a new active consumer when the race condition mentioned above comes up. Because of the changes in the structure of messages, a feature flag is required to make sure the SAC coordinator starts sending the new messages only when all the nodes have been upgraded. References #7743

References #7743

The stream plugin can send frames to a client connection and expect a response from it. This is used currently for the consumer_update frame (single active consumer feature). There was no timeout mechanism so far, so a slow or blocked application could prevent a group of consumers to move on. This commit introduces a timeout mechanism: if the expected response takes too long to arrive, the server assumes the connection is blocked and closes it. The default timeout is 60 seconds but it can be changed by setting the request_timeout parameter of the rabbitmq_stream application. Note the mechanism does not enforce the exact duration of the timeout, as a timer is set for the first request and re-used for other requests. With bad timing, a request can time out after twice as long as the set-up timeout. References #7743

References #7743

A group of consumers on a super stream can end up blocked without an active consumer. This can happen with consumer churn: one consumer gets removed, which makes the active consumer passive, but the former active consumer never gets to know because it has been removed itself. This commit changes the structure of the messages the SAC coordinator sends to consumer connections, to embed enough information to look up the group and to instruct it to choose a new active consumer when the race condition mentioned above comes up. Because of the changes in the structure of messages, a feature flag is required to make sure the SAC coordinator starts sending the new messages only when all the nodes have been upgraded. References #7743 (cherry picked from commit 70538c5)

References #7743 (cherry picked from commit f20f415)

The stream plugin can send frames to a client connection and expect a response from it. This is used currently for the consumer_update frame (single active consumer feature). There was no timeout mechanism so far, so a slow or blocked application could prevent a group of consumers to move on. This commit introduces a timeout mechanism: if the expected response takes too long to arrive, the server assumes the connection is blocked and closes it. The default timeout is 60 seconds but it can be changed by setting the request_timeout parameter of the rabbitmq_stream application. Note the mechanism does not enforce the exact duration of the timeout, as a timer is set for the first request and re-used for other requests. With bad timing, a request can time out after twice as long as the set-up timeout. References #7743 (cherry picked from commit 763acc2)

References #7743 (cherry picked from commit 4a669e1)

A group of consumers on a super stream can end up blocked without an active consumer. This can happen with consumer churn: one consumer gets removed, which makes the active consumer passive, but the former active consumer never gets to know because it has been removed itself. This commit changes the structure of the messages the SAC coordinator sends to consumer connections, to embed enough information to look up the group and to instruct it to choose a new active consumer when the race condition mentioned above comes up. Because of the changes in the structure of messages, a feature flag is required to make sure the SAC coordinator starts sending the new messages only when all the nodes have been upgraded. References #7743 (cherry picked from commit 70538c5) (cherry picked from commit 221f10d) # Conflicts: # deps/rabbit/src/rabbit_core_ff.erl # deps/rabbit/src/rabbit_stream_sac_coordinator.erl # deps/rabbitmq_stream/src/rabbit_stream_reader.erl

References #7743 (cherry picked from commit f20f415) (cherry picked from commit 0461e0a)

The stream plugin can send frames to a client connection and expect a response from it. This is used currently for the consumer_update frame (single active consumer feature). There was no timeout mechanism so far, so a slow or blocked application could prevent a group of consumers to move on. This commit introduces a timeout mechanism: if the expected response takes too long to arrive, the server assumes the connection is blocked and closes it. The default timeout is 60 seconds but it can be changed by setting the request_timeout parameter of the rabbitmq_stream application. Note the mechanism does not enforce the exact duration of the timeout, as a timer is set for the first request and re-used for other requests. With bad timing, a request can time out after twice as long as the set-up timeout. References #7743 (cherry picked from commit 763acc2) (cherry picked from commit 62d016d) # Conflicts: # deps/rabbitmq_stream/src/rabbit_stream_reader.erl # deps/rabbitmq_stream/test/rabbit_stream_SUITE.erl

References #7743 (cherry picked from commit 4a669e1) (cherry picked from commit d9739b2)

References #7743

They can be useful and are not on hot paths, but they are replicated on all nodes as part of the state machine replication, so we are better off removing them to avoid noise. References #7743

They can be useful and are not on hot paths, but they are replicated on all nodes as part of the state machine replication, so we are better off removing them to avoid noise. References #7743 (cherry picked from commit c2bfcc4)

References rabbitmq/rabbitmq-server#7743

They can be useful and are not on hot paths, but they are replicated on all nodes as part of the state machine replication, so we are better off removing them to avoid noise. References #7743 (cherry picked from commit c2bfcc4) (cherry picked from commit f7043fb)

fabiorosa-sn · 2024-06-05T13:10:56Z

We are currently facing the same problem with version 3.12.10 , Erlang 25.3.2.7.

It happened 2 in the ~ 3 months we are running with SuperStreams.

The only way for us to "fix it" is to scale down the consumer application to 0, and then scale up again.

Gsantomaggio · 2024-06-05T13:36:35Z

@fabiorosa-sn have you enabled stream_sac_coordinator_unblock_group? which stream client are you using?

fabiorosa-sn · 2024-06-05T13:54:35Z

yes, it's enabled.

we are using the java stream client version 0.15.0

fabiorosa-sn · 2024-06-05T14:45:32Z

Here is the result of the command

rabbitmqctl eval 'rabbit_stream_coordinator:sac_state(rabbit_stream_coordinator:state()).'

as requested by @Gsantomaggio

stream_coordinator_node-01.txt

msvechla · 2025-01-09T11:20:31Z

@Gsantomaggio This issue still exists, we are on RabbitMQ 3.12.14 Erlang 26.2.5.5

We are able to reproduce the issue with 5 active consumers and restarting all of them a couple of times.

Looking at the screenshots you can see that after some restarts one stream does not have any active consumer, with all consumers being in waiting state.

Can you give some more hints on how to diagnose this further? Thanks!

mkuratczyk · 2025-01-09T13:00:08Z

@msvechla the most helpful step towards solving this problem would be to provide the exact steps for reproducing the issue (even if one of the steps is non-deterministic, like "restart a few times"). Ideally you could use https://github.com/rabbitmq/rabbitmq-stream-perf-test so that we can easily follow the steps, but you can also use your own code, but provide it in a complete, executable format (a GH repo we can clone and build).

Please be precise: start with a fresh RabbitMQ, declare the super-stream (show us how you do it), then run stream-perf-test (or your app) - provide the exact command, etc.

If you can put together such a test case, please open a new issue with all the details.

msvechla · 2025-01-09T13:09:53Z

@mkuratczyk thanks. As we are currently not running this on our infrastructure, but on cloudamqp, we will also check with the support.

Thanks for providing the instructions, we will work on such a reproducible setup in case we can not find the root-cause there.

Just for my understanding: The scenario as shown in my screenshots above, as well as others in this issue here, where there are multiple connected consumers, but all of them are in waiting state, should never happen right?

Why is the RabbitMQ server not determining that there is no active consumer and attempting to elect a new one? Should we not see this in the RabbitMQ server logs somewhere?

mkuratczyk · 2025-01-09T14:29:37Z

Just for my understanding: The scenario as shown in my screenshots above, as well as others in this issue here, where there are multiple connected consumers, but all of them are in waiting state, should never happen right?

Yes, it should pick one of the available consumers.

Why is the RabbitMQ server not determining that there is no active consumer and attempting to elect a new one?

Well, that's what we need to understand and reproduction steps would help grately. :)

jonnepmyra · 2025-03-27T07:12:43Z

Hi, i can confirm that the issue still exists, running 3.13.7. This happened after a node restart (in a cluster of 3 rabbitmq nodes).

Don't know if it matters, but our clusters are relative big, with around 1000 superstreams.... so i guess there's ALOT of SAC-switching happening when we restart a rmq-node

dukhaSlayer · 2025-05-01T18:47:19Z

Can confirm, issue is still reproducible on 3.13.7

michaelklishin · 2025-05-01T23:30:12Z

@dukhaSlayer 3.13.x and 4.0.x are out of community support.

There were several stream SAC changes considered or adopted for 4.1.0, such as #13657 (merged) and #13671 (rejected).

Time to upgrade.

Gsantomaggio · 2025-05-02T06:42:49Z

There were several stream SAC changes considered or adopted for 4.1.0, such as #13657 (merged) and #13671 (rejected).

Please also follow this PR: #13672

michaelklishin · 2025-05-02T14:58:23Z

Thank you, @Gsantomaggio. That's indeed relevant, just hasn't been finished or shipped yet :)

Gsantomaggio added the bug label Mar 27, 2023

acogoluegnes mentioned this issue Mar 30, 2023

Unblock group of consumers on super stream partition #7765

Merged

acogoluegnes added a commit that referenced this issue Mar 30, 2023

Fix test after message structure change

51f176e

References #7743

acogoluegnes added a commit that referenced this issue Mar 31, 2023

Fix test after message structure change

90eedfa

References #7743

michaelklishin added this to the 3.11.14 milestone Mar 31, 2023

acogoluegnes added a commit that referenced this issue Apr 3, 2023

Fix field type definition

feb4da2

References #7743

michaelklishin pushed a commit that referenced this issue Apr 4, 2023

Fix test after message structure change

f20f415

References #7743

michaelklishin pushed a commit that referenced this issue Apr 4, 2023

Fix field type definition

4a669e1

References #7743

mergify bot pushed a commit that referenced this issue Apr 4, 2023

Fix test after message structure change

0461e0a

References #7743 (cherry picked from commit f20f415)

mergify bot pushed a commit that referenced this issue Apr 4, 2023

Fix field type definition

d9739b2

References #7743 (cherry picked from commit 4a669e1)

mergify bot pushed a commit that referenced this issue Apr 4, 2023

Fix test after message structure change

48f76ec

References #7743 (cherry picked from commit f20f415) (cherry picked from commit 0461e0a)

mergify bot pushed a commit that referenced this issue Apr 4, 2023

Fix field type definition

c90bcf6

References #7743 (cherry picked from commit 4a669e1) (cherry picked from commit d9739b2)

acogoluegnes added a commit that referenced this issue Apr 5, 2023

Fix conflicts

7c60367

References #7743

acogoluegnes mentioned this issue Apr 5, 2023

Remove debug log statements in stream SAC coordinator #7841

Merged

acogoluegnes added a commit to rabbitmq/rabbitmq-stream-java-client that referenced this issue Apr 5, 2023

Test connection is closed if consumer update takes too long

80eeaac

References rabbitmq/rabbitmq-server#7743

github-actions bot pushed a commit to rabbitmq/rabbitmq-stream-java-client that referenced this issue Apr 5, 2023

Test connection is closed if consumer update takes too long

46d8999

References rabbitmq/rabbitmq-server#7743

michaelklishin modified the milestones: 3.11.14, 3.12.0 Apr 5, 2023

michaelklishin closed this as completed Apr 5, 2023

michaelklishin modified the milestones: 3.12.0, 3.11.14 Apr 11, 2023

michaelklishin mentioned this issue Aug 23, 2023

A stream with a group of SACs may fail to "elect" an active consumer after a rolling cluster restart #9159

Closed

rabbitmq locked and limited conversation to collaborators May 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SuperStream doesn't elect the single active consumer #7743

SuperStream doesn't elect the single active consumer #7743

Gsantomaggio commented Mar 27, 2023 •

edited

Loading

fabiorosa-sn commented Jun 5, 2024

Gsantomaggio commented Jun 5, 2024

fabiorosa-sn commented Jun 5, 2024 •

edited

Loading

fabiorosa-sn commented Jun 5, 2024 •

edited

Loading

msvechla commented Jan 9, 2025 •

edited

Loading

mkuratczyk commented Jan 9, 2025

msvechla commented Jan 9, 2025

mkuratczyk commented Jan 9, 2025

jonnepmyra commented Mar 27, 2025

dukhaSlayer commented May 1, 2025

michaelklishin commented May 1, 2025 •

edited

Loading

Gsantomaggio commented May 2, 2025

michaelklishin commented May 2, 2025

SuperStream doesn't elect the single active consumer #7743

SuperStream doesn't elect the single active consumer #7743

Comments

Gsantomaggio commented Mar 27, 2023 • edited Loading

Describe the bug

Reproduction steps

Expected behavior

Additional context

fabiorosa-sn commented Jun 5, 2024

Gsantomaggio commented Jun 5, 2024

fabiorosa-sn commented Jun 5, 2024 • edited Loading

fabiorosa-sn commented Jun 5, 2024 • edited Loading

msvechla commented Jan 9, 2025 • edited Loading

mkuratczyk commented Jan 9, 2025

msvechla commented Jan 9, 2025

mkuratczyk commented Jan 9, 2025

jonnepmyra commented Mar 27, 2025

dukhaSlayer commented May 1, 2025

michaelklishin commented May 1, 2025 • edited Loading

Gsantomaggio commented May 2, 2025

michaelklishin commented May 2, 2025

Gsantomaggio commented Mar 27, 2023 •

edited

Loading

fabiorosa-sn commented Jun 5, 2024 •

edited

Loading

fabiorosa-sn commented Jun 5, 2024 •

edited

Loading

msvechla commented Jan 9, 2025 •

edited

Loading

michaelklishin commented May 1, 2025 •

edited

Loading