-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] consumers stops receiving new messages due to invalid blockedConsumerOnUnackedMsgs state #22657
Comments
Thanks for the great issue report @180254. @poorbarcode or @Technoboy- do you have a chance to take a look at this issue report? |
@180254 in your case, can you detect the issue from topic stats? for example, does it tell |
sorry, noticed this now. I guess topics stats wouldn't have |
Here are the statistics for a "broken topic", collected them after a test.
There are 4 consumers, each of them is |
@AdrianPedziwiatr-TomTom thanks for sharing. This is an interesting detail. |
A little progress in reproducing the problem in the unit test: 180254@5822ab6 (test22657_1_parameterized) v3.0.1: fails when [ |
I reproduced the problem also for larger values of maxUnackedMsgPerConsumer. Please see: Test results:
Some log from the failed case:
(in summary)
At branch-3.0...180254:pulsar-issue-22657:branch-3.0 you can find all my tests and the restored old version of the individualAckNormal method for testing/comparison. |
Getting the same issue, consumer stops receiving new messages { pulsar version 3.2.1 |
@prasathsekar You might be facing another bug that is already fixed in 3.2.3 with #22454. Please upgrade to Pulsar 3.2.3 and then comment whether the problem is resolved. |
Hi @Technoboy-, @poorbarcode Will you have a chance to look at this? You might have the biggest knowledge, because the commit that changed this was yours. |
@MichalKoziorowski-TomTom please confirm whether this problem reproduces on 3.2.3 or 3.0.5 . |
Before submitting a ticket, we also checked 3.2.x. The problem with our service also occurred there. I reran the proposed BrokerServiceTest.java tests that I shared in previous messages. |
I checked the code, my first vision is maybe it could have race conditions here. But I didn't dive deeper. |
@180254 I experimented with some changes in lhotari#192 , I added test cases based on your work. There are multiple inconsistencies in handling the unacked message counts and blocking/unblocking dispatchers. The main gap in the experiment is the handling for negative acknowledgements. |
Hi. Is there any chance someone will look at this race condition? We're trying to figure out some workaround to not see this problem. |
I'm doing this now. I'm sorry for the long delay. |
I remember seeing that this is a regression caused by #20990. |
It looks like #23072 makes improvements to the part where the regression happened. |
Rebased the test cases added by @180254 here: lhotari@ff0c8a5 . It looks like the problem persists after #21126 and #23072 . |
Great job by @180254 in doing the troubleshooting, thank you! |
Resuming this work would be needed. |
Search before asking
Read release policy
Version
pulsar server: docker image apachepulsar/pulsar:3.0.4 + helm chart pulsar-helm-chart
pulsar client: java client org.apache.pulsar:pulsar-client:3.0.4
Minimal reproduce step
After updating Apache Pulsar, we noticed that one of the consumers sometimes stops receiving new messages for some topics.
The last fully working version for us is 3.0.1. I have tested all later versions released so far and also built a branch-3.0.
I looked through the commits and determined when our service stops working:
I performed a test using the last commit from branch 3.0 (fd823f6) and reverting the individualAckNormal method to the last version before the "commit which breaks." The change looks as follows: 180254@6dac4bf. I have no problem with the modified code.
I found nothing in the logs that would inform me about the consumer suspension, etc. There are no unusual logs at all. Restarting the Kubernetes pod with consumers has helped for some time.
What did you expect to see?
consumer retrieves all messages
What did you see instead?
consumers stops receiving new messages for some topics
Anything else?
The configuration we use:
We can reproduce it on our service. Test scenario: serviced approximately 20 customers (== 20 topics), each with about 20 messages per second. 1 message is processed in approximately 200ms. The problem occurs for a certain number of topics in the test, not for all
When a problem occurs:
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: