Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(source): create source stream with retry during recovery #19805

Merged
merged 14 commits into from
Dec 18, 2024

Conversation

StrikeW
Copy link
Contributor

@StrikeW StrikeW commented Dec 16, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Previously we reverted the retry for source executor #19754. This PR we re-introduce the retry mechanism but only during recovery process. Because when source job create in the first time, it should fail if cannot reach the upstream system.

And we merge the patch #19714 into this PR

passed main-cron https://buildkite.com/risingwavelabs/main-cron/builds/4092
close #19681

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

  • My PR needs documentation updates.
Release note

@StrikeW StrikeW requested review from hzxa21 and xxchan December 16, 2024 04:08
@xxchan xxchan requested a review from BugenZhao December 16, 2024 04:46
Copy link
Member

@xxchan xxchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may remove the workaround here to get it tested.

# XXX: wait until source reader is ready. then produce data.
# This is a temporary workaround for a data loss bug https://github.com/risingwavelabs/risingwave/issues/19681#issuecomment-2532183002
sleep 2s

BTW, is there any test case covered the CDC sync issue #19681?

Copy link
Member

@xxchan xxchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

@StrikeW
Copy link
Contributor Author

StrikeW commented Dec 16, 2024

We may remove the workaround here to get it tested.

# XXX: wait until source reader is ready. then produce data.
# This is a temporary workaround for a data loss bug https://github.com/risingwavelabs/risingwave/issues/19681#issuecomment-2532183002
sleep 2s

BTW, is there any test case covered the CDC sync issue #19681?

I am testing #19681 with the test pipeline https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc-shared-source

@StrikeW StrikeW requested review from xxchan and hzxa21 December 17, 2024 02:28
@StrikeW
Copy link
Contributor Author

StrikeW commented Dec 17, 2024

We may remove the workaround here to get it tested.

# XXX: wait until source reader is ready. then produce data.
# This is a temporary workaround for a data loss bug https://github.com/risingwavelabs/risingwave/issues/19681#issuecomment-2532183002
sleep 2s

BTW, is there any test case covered the CDC sync issue #19681?

I am testing #19681 with the test pipeline https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc-shared-source

Test passed https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc-shared-source/builds/133

LOG.info(
"creating cdc source job: publication '{}' doesn't exist, creating...",
pubName);
DbzSourceUtils.createPostgresPublicationInValidate(userProps);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to create the publication in meta instead of doing it in the source reader? I thought the CREATE SOURCE logic is unchanged (block the 1st barrier until source reader creation succeeds) before and after this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see in the source_executor.rs, we forward the initial barrier first before we create the source reader. IIRC, after Meta collected the initial barrier the CREATE SOURCE will return success, and then following CREATE TABLE command can be issued to Meta. So during validation of CREATE TABLE, the pg publication may not have been created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this race exist for a long time ago, not the new problem introduced by this PR.

@hzxa21
Copy link
Collaborator

hzxa21 commented Dec 17, 2024

We may remove the workaround here to get it tested.

# XXX: wait until source reader is ready. then produce data.
# This is a temporary workaround for a data loss bug https://github.com/risingwavelabs/risingwave/issues/19681#issuecomment-2532183002
sleep 2s

BTW, is there any test case covered the CDC sync issue #19681?

I am testing #19681 with the test pipeline https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc-shared-source

Test passed https://buildkite.com/risingwave-test/ch-benchmark-pg-cdc-shared-source/builds/133

Will the ch-benchmark test the case when upstream is unavailable? If not, I think we still in lack of a test case for that.

@StrikeW StrikeW requested a review from hzxa21 December 17, 2024 05:39
@hzxa21
Copy link
Collaborator

hzxa21 commented Dec 17, 2024

Link to the context: #19681 (comment)

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StrikeW StrikeW enabled auto-merge December 18, 2024 08:54
@StrikeW StrikeW added this pull request to the merge queue Dec 18, 2024
@graphite-app graphite-app bot requested a review from a team December 18, 2024 09:02
Merged via the queue into main with commit fe84f69 Dec 18, 2024
28 of 29 checks passed
@StrikeW StrikeW deleted the siyuan/source-retry branch December 18, 2024 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some tables doesn't sync for PG CDC v2.1.0-rt.8
4 participants