-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Ledger can not recover with Digest Mismatch Error #22906
Comments
This is most likely fixed by #22892 in Pulsar and with Bookkeeper fixes apache/bookkeeper#4426, apache/bookkeeper#4196, apache/bookkeeper#4289 and apache/bookkeeper#4293. The Bookkeeper fixes are included in 4.16.6 release candidate 0 which was recently published and will be part of 4.16.6 release. We will release Pulsar 3.0.6, 3.2.4 and 3.3.1 with the fixes after the Bookkeeper release has completed.
Does this issue also preserve after bookie and broker restarts? In that case this matches the symptoms which should get fixed with the above fixes. Bookkeeper doesn't validate the digest when it stores data. If the data gets corrupted before storing (because of bugs), corrupted data gets stored and it seems that it requires manual intervention (or using the dangerous autoSkipNonRecoverableData setting in brokers) to skip the entries that are corrupted. |
There are also some pending Bookkeeper issues and PRs that might be related such as apache/bookkeeper#4171 and apache/bookkeeper#4194. /cc @shoothzj |
@TakaHiR07 Do you happen to use TLS between brokers and bookies? One possible workaround until Pulsar 3.0.6/3.2.4/3.3.1 is released with Bookkeeper 4.16.6 is to set |
@lhotari The issue is preserve after restart, actually data gets corrupted before storing. And we do not use TLS, but we use bookkeeperUseV2WireProtocol=true. |
Yes, Please test with recent branch-3.0 + BK 4.16.6 rc0. There's also broker cache race condition fixes which now prevent some similar symptoms. |
Search before asking
Read release policy
Version
client version : 2.9.5
broker version : 3.0.5
bookie version : 4.16.5
Minimal reproduce step
We are doing perf test in pulsar new version.
use pulsar-perf to produce to a topic with 100 partitions, and simulate that broker restart frequently because of direct memory OOM.
However, after running for long time, the topic become unavailable. And we find that one partition can not recover because of digest mismatch.
We can reproduce this issue, it occurs almost every three days.
broker OOM :
topic become unavailable :
What did you expect to see?
ledger recover success.
What did you see instead?
From the broker log, we see that ledger 6877081 can not recover with digest mismatch. ledger is wrote into 3 quorums without ensemble-change, the config E-Qw-Qa is 3-3-2.
We have added so many logs to trace the entry, both in broker and bookie. And we add the calculation of entry's md5, aiming to find out where the entry has been changed. The logs and description are as follows.
broker's log :
ip1:3181 (bookie1)
ip2:3188(bookie2)
ip3:3185(bookie3)
From the above log, we find that
Anything else?
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: