How to handle interop when binlog pointer changes, or when migrating between binlog and column-based incremental sync #1015

aaronsteers · 2022-09-29T17:20:50Z

aaronsteers
Sep 29, 2022

This came up in slack: https://meltano.slack.com/archives/C01TCRBBJD7/p1664439495487019

And also is related to LOG_BASED SDK discussion:

Spec for log-based replication in the SDK #1012

Challenge statement 1

The source data for MySQL (for instance) is migrating from one backend RDBMS instance to a new one. In the process of migrating the data, new binlogs are created which don't match the old identifiers.

As of now, the user would have to manually edit the STATE pointers to reflect the new binlog position pointers, or else start over with a full backfill.

Challenge statement 2

In an initial backfill, FULL_TABLE or INCREMENTAL is used to quickly export all historical data. However, if there was no binlog position recorded, we don't have an easy way to transition from column-based incremental to log-based incremental.

aaronsteers · 2022-09-29T17:34:58Z

aaronsteers
Sep 29, 2022
Author

Here is one proposal I was considering of when working on the LOG_BASED spec.

Proposal: track bookmarks relevant to `INCREMENTAL` and also `LOG_BASED` replication

One possible solution would be to allow/encourage the utilization of both incremental keys and also BINLOG replication, with one playing the primary role, but with the other held for redundancy, in case replication mode needs to be toggled for any reason.

If the state bookmark has a tracker for replication_key_value as well as log_based_replication_key_value, then (at least in theory), the user can toggle between states.

In the case that the tap is running in INCREMENTAL mode and cannot observe binlog pointers, then the binlog pointer would be reset to null after an INCREMENTAL sync.

Toggle behavior

Switching from `LOG_BASED` to `INCREMENTAL` replication

The replication would continue using the available column-based minimum value - since that value was already being tracked in the bookmark.

Switching from `INCREMENTAL` to `LOG_BASED`

The tap may still need to scan all historic records in the binlog, but it would only transmit those records if the incremental-replication key column (updated_at) is greater than or equal to the max value stored in the bookmark.

Switching from `LOG_BASED` to `INCREMENTAL` and back to `LOG_BASED` (binlog reset)

By switching from LOG_BASED to INCREMENTAL, the system would begin using column-based keys as primary. If binlog pointers can be observed, the new pointers observed would replace the prior ones. If they cannot be observed, then the pointer would be reset to null.

Upon switching back then LOG_BASED, the EL pipeline would effectively be reset with the new pointers. As with the INCREMENTAL to LOG_BASED toggle described above, the tap may still need to scan all historic records in the system's binlog, but it would only transmit those records if the incremental-replication key column (updated_at) is greater than or equal to the max value stored in the bookmark.

0 replies

kgpayne · 2022-09-29T17:43:05Z

kgpayne
Sep 29, 2022

@aaronsteers re: challenge 2 - they way I've handled this in the past, at least with postgres, is to create the replication slot (which starts collecting binlogs for a particular consumer in postgres) before doing a FULL_TABLE sync. This ensures the binlog captures any changes whilst the full table sync happens, and once the second job runs in INCREMENTAL, it just picks up all available binlogs (leaving a bookmark for incremental thereafter). Maybe postgres is a special case though. Replication slots are designed to be used as a 1:1 mapping between log stream and consumer, and logs will be kept until the consumer fetches them. So the challenge you are describing doesn't really apply to postgres, though users who do a backfill before creating a replication slot risk losing data (similar outcome) 🤔 In the MySQL case, you'd basically want to set your binlog retention to be longer than the time it takes to do a full refresh, to ensure that replaying the entire available binlog onto your FULL_TABLE sync captures all data.

I actually think pipelinewise supports the case where a new stream is added configured with BINLOG as its replication mode. Under the hood, PPW will i) check the replication slot exists, ii) check the destination table exists (and find that it doesn't), iii) fall back on a full-table sync and iv) then finally run a first binlog sync to 'catch up' the full-table sync. It's pretty neat. I believe the code path is similar for an explicit full-refresh, whereby you want to force a full-table sync before returning to binlog replication. I'd have to look again at pipelinewise-tap-postgres to be sure as its been a while 😅

1 reply

aaronsteers Sep 29, 2022
Author

Cool! Thanks for this context, @kgpayne! Super helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle interop when binlog pointer changes, or when migrating between binlog and column-based incremental sync #1015

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to handle interop when binlog pointer changes, or when migrating between binlog and column-based incremental sync #1015

aaronsteers Sep 29, 2022

Challenge statement 1

Challenge statement 2

Replies: 2 comments · 1 reply

aaronsteers Sep 29, 2022 Author

Proposal: track bookmarks relevant to INCREMENTAL and also LOG_BASED replication

Toggle behavior

Switching from LOG_BASED to INCREMENTAL replication

Switching from INCREMENTAL to LOG_BASED

Switching from LOG_BASED to INCREMENTAL and back to LOG_BASED (binlog reset)

kgpayne Sep 29, 2022

aaronsteers Sep 29, 2022 Author

aaronsteers
Sep 29, 2022

Replies: 2 comments 1 reply

aaronsteers
Sep 29, 2022
Author

Proposal: track bookmarks relevant to `INCREMENTAL` and also `LOG_BASED` replication

Switching from `LOG_BASED` to `INCREMENTAL` replication

Switching from `INCREMENTAL` to `LOG_BASED`

Switching from `LOG_BASED` to `INCREMENTAL` and back to `LOG_BASED` (binlog reset)

kgpayne
Sep 29, 2022

aaronsteers Sep 29, 2022
Author