[fix][test] Fix ManagedCursorTest.testForceCursorRecovery #23518

ZhaoGuorui666 · 2024-10-27T13:54:14Z

Motivation

Ideally, recoverFromLedger should only call operationComplete upon successful recovery and operationFailed upon failure. However, in the actual implementation, both methods may be called due to the following reasons:

Concurrent Execution: If recoverFromLedger is executed in different threads, it may simultaneously meet both success and failure conditions, leading to both callbacks being triggered.

After adding logs, we may see both:

Cursor recovery success
Cursor recovery failed

Modifications

Use an atomic variable. Use an AtomicBoolean in the callback to mark whether the callback method has already been called, preventing duplicate calls.

Documentation

doc
doc-required
doc-not-needed
doc-complete

github-actions · 2024-10-27T13:54:43Z

@ZhaoGuorui666 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

nodece · 2024-10-28T07:46:12Z

Could you verify https://github.com/apache/pulsar/pull/23518/files#diff-42dd67f8328871b3f0d30446ed1a4f1856807860c5226a9ec6c4d3a9f8813428R4882? Hrere should append a return.

ZhaoGuorui666 · 2024-10-29T03:04:56Z

I understand the ManagedCursorImpl.recoverFromLedger() called by the bottom layer of this test, which calls Bookeeper.syncOpenLedger() instead of TestPulsarMockBookKeeper

ZhaoGuorui666 · 2024-10-29T03:07:31Z

Execute 3000 times, all passed

codecov-commenter · 2024-10-30T03:15:24Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 74.31%. Comparing base (bbc6224) to head (f3c433e).
Report is 701 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #23518      +/-   ##
============================================
+ Coverage     73.57%   74.31%   +0.74%     
- Complexity    32624    34906    +2282     
============================================
  Files          1877     1943      +66     
  Lines        139502   147001    +7499     
  Branches      15299    16196     +897     
============================================
+ Hits         102638   109247    +6609     
- Misses        28908    29306     +398     
- Partials       7956     8448     +492

Flag	Coverage Δ
inttests	`27.34% <ø> (+2.76%)`	⬆️
systests	`24.39% <ø> (+0.07%)`	⬆️
unittests	`73.67% <ø> (+0.82%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 650 files with indirect coverage changes

nodece · 2024-10-30T03:22:21Z

In this test, we are using org.apache.bookkeeper.mledger.impl.ManagedCursorTest.TestPulsarMockBookKeeper.

ManagedCursorImpl.recoverFromLedger() will call this method from that:

  public void asyncOpenLedger(final long lId, final DigestType digestType, final byte[] passwd,
                final OpenCallback cb, final Object ctx) {
            if (ledgerErrors.containsKey(lId)) {
                cb.openComplete(ledgerErrors.get(lId), null, ctx);
                return; // TODO: Please add a return.
            }
            super.asyncOpenLedger(lId, digestType, passwd, cb, ctx);
        }

It seems that the test results were affected by multiple callback calls.

nodece

Left a comment.

ZhaoGuorui666 · 2024-10-30T03:55:02Z

@nodece Thank you for your prompt and suggestion. It has been revised.

summeriiii · 2024-10-30T04:52:17Z

2024-10-30T12:38:15,591 - ERROR - [main:ManagedCursorImpl@553] - [my_test_ledger] Error opening metadata ledger -1 for cursor c1: Bookie handle is not available
2024-10-30T12:38:15,592 - INFO  - [main:ManagedCursorImpl@750] - [my_test_ledger] Cursor c1 recovered to position 3:-1
2024-10-30T12:38:15,592 - INFO  - [main:MetaStoreImpl@257] - expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1
markDeleteLedgerId: 3
markDeleteEntryId: -1
lastActive: 1730263095578

2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:ManagedCursorImpl@550] - [my_test_ledger] Opened ledger -1 for cursor c1. rc=-7
2024-10-30T12:38:15,593 - ERROR - [test-OrderedScheduler-0-0:ManagedCursorImpl@553] - [my_test_ledger] Error opening metadata ledger -1 for cursor c1: No such ledger exists on Bookies
2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:ManagedCursorImpl@750] - [my_test_ledger] Cursor c1 recovered to position 3:-1
2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:MetaStoreImpl@257] - expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1
markDeleteLedgerId: 3
markDeleteEntryId: -1
lastActive: 1730263095578

2024-10-30T12:38:15,601 - INFO  - [bookkeeper-ml-scheduler-OrderedScheduler-1-0:ManagedCursorImpl$25@2772] - ledger.getStore().asyncUpdateCursorInfo operation complete
2024-10-30T12:38:15,602 - WARN  - [bookkeeper-ml-scheduler-OrderedScheduler-1-0:ManagedCursorImpl$25@2779] - [my_test_ledger] Failed to update cursor metadata for c1 due to version conflict org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException:

java.lang.AssertionError:
Expected :true
Actual   :false
<Click to see difference>


	at org.testng.Assert.fail(Assert.java:110)
	at org.testng.Assert.failNotEquals(Assert.java:1577)
	at org.testng.Assert.assertTrue(Assert.java:56)
	at org.testng.Assert.assertTrue(Assert.java:66)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorTest.testForceCursorRecovery(ManagedCursorTest.java:4858)

From the log of failed test, we can see expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1 twice.
The root cause of this flaky test is we call the cb.openComplete() twice, which may cause race condition: when the first cursorInfo has not been saved, the second update cursorInfo started use the oldVersion, so when to save the second cursorInfo meta, it will throw BadVersion exception

So We don't need to add AtomicBoolean callbackInvoked = new AtomicBoolean(false);, we can just fix this to add return in TestPulsarMockBookKeeper.asyncOpenLedger or add else like

public void asyncOpenLedger(final long lId, final DigestType digestType, final byte[] passwd,
        final OpenCallback cb, final Object ctx) {
    if (ledgerErrors.containsKey(lId)) {
        cb.openComplete(ledgerErrors.get(lId), null, ctx);
    } else {
        super.asyncOpenLedger(lId, digestType, passwd, cb, ctx);
    }
}

ZhaoGuorui666 · 2024-10-30T05:26:44Z

2024-10-30T12:38:15,591 - ERROR - [main:ManagedCursorImpl@553] - [my_test_ledger] Error opening metadata ledger -1 for cursor c1: Bookie handle is not available
2024-10-30T12:38:15,592 - INFO  - [main:ManagedCursorImpl@750] - [my_test_ledger] Cursor c1 recovered to position 3:-1
2024-10-30T12:38:15,592 - INFO  - [main:MetaStoreImpl@257] - expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1
markDeleteLedgerId: 3
markDeleteEntryId: -1
lastActive: 1730263095578

2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:ManagedCursorImpl@550] - [my_test_ledger] Opened ledger -1 for cursor c1. rc=-7
2024-10-30T12:38:15,593 - ERROR - [test-OrderedScheduler-0-0:ManagedCursorImpl@553] - [my_test_ledger] Error opening metadata ledger -1 for cursor c1: No such ledger exists on Bookies
2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:ManagedCursorImpl@750] - [my_test_ledger] Cursor c1 recovered to position 3:-1
2024-10-30T12:38:15,593 - INFO  - [test-OrderedScheduler-0-0:MetaStoreImpl@257] - expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1
markDeleteLedgerId: 3
markDeleteEntryId: -1
lastActive: 1730263095578

2024-10-30T12:38:15,601 - INFO  - [bookkeeper-ml-scheduler-OrderedScheduler-1-0:ManagedCursorImpl$25@2772] - ledger.getStore().asyncUpdateCursorInfo operation complete
2024-10-30T12:38:15,602 - WARN  - [bookkeeper-ml-scheduler-OrderedScheduler-1-0:ManagedCursorImpl$25@2779] - [my_test_ledger] Failed to update cursor metadata for c1 due to version conflict org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException:

java.lang.AssertionError:
Expected :true
Actual   :false
<Click to see difference>


	at org.testng.Assert.fail(Assert.java:110)
	at org.testng.Assert.failNotEquals(Assert.java:1577)
	at org.testng.Assert.assertTrue(Assert.java:56)
	at org.testng.Assert.assertTrue(Assert.java:66)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorTest.testForceCursorRecovery(ManagedCursorTest.java:4858)

From the log of failed test, we can see expectedVersion:0 [my_test_ledger] Creating consumer c1 on meta-data store with cursorsLedgerId: -1 twice. The root cause of this flaky test is we call the cb.openComplete() twice, which may cause race condition: when the first cursorInfo has not been saved, the second update cursorInfo started use the oldVersion, so when to save the second cursorInfo meta, it will throw BadVersion exception

So We don't need to add AtomicBoolean callbackInvoked = new AtomicBoolean(false);, we can just fix this to add return in TestPulsarMockBookKeeper.asyncOpenLedger or add else like

public void asyncOpenLedger(final long lId, final DigestType digestType, final byte[] passwd,
        final OpenCallback cb, final Object ctx) {
    if (ledgerErrors.containsKey(lId)) {
        cb.openComplete(ledgerErrors.get(lId), null, ctx);
    } else {
        super.asyncOpenLedger(lId, digestType, passwd, cb, ctx);
    }
}

Yes, doing an if else or adding a return is the way to truly solve the problem. Here are the test results:

nodece

LGTM

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java

lhotari

LGTM, good work @ZhaoGuorui666. Thanks for putting effort in fixing flaky tests.

(cherry picked from commit d2d05c2)

ZhaoGuorui666 added 2 commits October 27, 2024 13:35

issue apache#23417 Flaky-test

b4371f0

issue apache#23417 Flaky-test

f3c433e

github-actions bot added the doc-label-missing label Oct 27, 2024

github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Oct 27, 2024

Technoboy- approved these changes Oct 30, 2024

View reviewed changes

Technoboy- assigned ZhaoGuorui666 Oct 30, 2024

Technoboy- added this to the 4.1.0 milestone Oct 30, 2024

Technoboy- added the ready-to-test label Oct 30, 2024

nodece requested changes Oct 30, 2024

View reviewed changes

issue apache#23417 Flaky-test - add return

1adf50b

issue apache#23417 Flaky-test

d2ff10e

nodece approved these changes Oct 30, 2024

View reviewed changes

lhotari reviewed Oct 30, 2024

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

nodece reviewed Oct 31, 2024

View reviewed changes

managed-ledger/src/test/java/org/apache/bookkeeper/mledger/impl/ManagedCursorTest.java Outdated Show resolved Hide resolved

ZhaoGuorui666 added 2 commits November 3, 2024 19:13

Merge branch 'apache:master' into testForceCursorRecovery

8425f47

solve issue#23417

3dd817f

ZhaoGuorui666 requested a review from lhotari November 4, 2024 08:53

lhotari approved these changes Nov 5, 2024

View reviewed changes

lhotari added the release/4.0.1 label Nov 5, 2024

nodece merged commit d2d05c2 into apache:master Nov 5, 2024
52 checks passed

visxu pushed a commit to vissxu/pulsar that referenced this pull request Nov 6, 2024

[fix][test] Fix ManagedCursorTest.testForceCursorRecovery (apache#23518)

b3b542c

lhotari pushed a commit that referenced this pull request Nov 13, 2024

[fix][test] Fix ManagedCursorTest.testForceCursorRecovery (#23518)

768132e

(cherry picked from commit d2d05c2)

lhotari added the cherry-picked/branch-4.0 label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][test] Fix ManagedCursorTest.testForceCursorRecovery #23518

[fix][test] Fix ManagedCursorTest.testForceCursorRecovery #23518

ZhaoGuorui666 commented Oct 27, 2024 •

edited

Loading

github-actions bot commented Oct 27, 2024

nodece commented Oct 28, 2024

ZhaoGuorui666 commented Oct 29, 2024

ZhaoGuorui666 commented Oct 29, 2024

codecov-commenter commented Oct 30, 2024

nodece commented Oct 30, 2024

nodece left a comment

ZhaoGuorui666 commented Oct 30, 2024

summeriiii commented Oct 30, 2024

ZhaoGuorui666 commented Oct 30, 2024

nodece left a comment

lhotari left a comment

[fix][test] Fix ManagedCursorTest.testForceCursorRecovery #23518

[fix][test] Fix ManagedCursorTest.testForceCursorRecovery #23518

Conversation

ZhaoGuorui666 commented Oct 27, 2024 • edited Loading

Motivation

Modifications

Documentation

github-actions bot commented Oct 27, 2024

nodece commented Oct 28, 2024

ZhaoGuorui666 commented Oct 29, 2024

ZhaoGuorui666 commented Oct 29, 2024

codecov-commenter commented Oct 30, 2024

Codecov Report

nodece commented Oct 30, 2024

nodece left a comment

Choose a reason for hiding this comment

ZhaoGuorui666 commented Oct 30, 2024

summeriiii commented Oct 30, 2024

ZhaoGuorui666 commented Oct 30, 2024

nodece left a comment

Choose a reason for hiding this comment

lhotari left a comment

Choose a reason for hiding this comment

ZhaoGuorui666 commented Oct 27, 2024 •

edited

Loading