Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic Transaction bug fix with PRS disruption #16576

Merged

Conversation

GuptaManan100
Copy link
Member

@GuptaManan100 GuptaManan100 commented Aug 10, 2024

Description

This PR adds an e2e test for testing atomic transactions with a PRS disruption. While adding the test, a bug was also noticed. This PR fixes that bug too.

These are the following sequence of operations that led to the bug -

  1. Users run an atomic transaction and tries to commit it.
  2. As part of the commit, vtgate tries to call CommitPrepared on all the tablets.
  3. An arbitrary reason can cause the CommitPrepared on a vttablet to stall (In the test we explicitly sleep for 5 seconds)
  4. While the CommitPrepared call is stalled, a PRS is issued in that shard that changes the primary tablet.
  5. During DemotePrimary, we rollback all the prepared transactions by calling FetchAll on the prepared pool.
  6. CommitPrepared finally runs, but it doesn't find any connection since FetchAll already removed them all. So it concludes that there is nothing to commit. It returns a success to the vtgate.
  7. Because vtgate gets a success from all the vttablets, it actually marks the transaction as completed! But in reality we didn't commit anything on the shard we ran PRS on!
  8. This leads to a transaction that is non-atomic since one shard didn't commit its changes.

The fix is pretty straight-forward. We just mark the pool is shutdown after we have called FetchAll. Any calls to FetchForCommit return an error until we transition to AcceptingReadWrite state again.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Copy link

codecov bot commented Aug 10, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 68.76%. Comparing base (bf0c5f8) to head (bcb4247).
Report is 2 commits behind head on main.

Files Patch % Lines
go/vt/vttablet/tabletserver/production.go 0.00% 1 Missing ⚠️
go/vt/vttablet/tabletserver/tabletserver.go 50.00% 1 Missing ⚠️
go/vt/vttablet/tabletserver/tx_prep_pool.go 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16576      +/-   ##
==========================================
- Coverage   68.77%   68.76%   -0.01%     
==========================================
  Files        1556     1557       +1     
  Lines      199812   199826      +14     
==========================================
- Hits       137419   137412       -7     
- Misses      62393    62414      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@GuptaManan100 GuptaManan100 merged commit 3c2e8f9 into vitessio:main Aug 12, 2024
129 checks passed
@GuptaManan100 GuptaManan100 deleted the atomic-transactions-prs-disruption branch August 12, 2024 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants