How to handle interop when binlog pointer changes, or when migrating between binlog and column-based incremental sync #1015
Replies: 2 comments 1 reply
-
Here is one proposal I was considering of when working on the LOG_BASED spec. Proposal: track bookmarks relevant to
|
Beta Was this translation helpful? Give feedback.
-
@aaronsteers re: challenge 2 - they way I've handled this in the past, at least with postgres, is to create the replication slot (which starts collecting binlogs for a particular consumer in postgres) before doing a FULL_TABLE sync. This ensures the binlog captures any changes whilst the full table sync happens, and once the second job runs in INCREMENTAL, it just picks up all available binlogs (leaving a bookmark for incremental thereafter). Maybe postgres is a special case though. Replication slots are designed to be used as a 1:1 mapping between log stream and consumer, and logs will be kept until the consumer fetches them. So the challenge you are describing doesn't really apply to postgres, though users who do a backfill before creating a replication slot risk losing data (similar outcome) 🤔 In the MySQL case, you'd basically want to set your binlog retention to be longer than the time it takes to do a full refresh, to ensure that replaying the entire available binlog onto your FULL_TABLE sync captures all data. I actually think pipelinewise supports the case where a new stream is added configured with BINLOG as its replication mode. Under the hood, PPW will i) check the replication slot exists, ii) check the destination table exists (and find that it doesn't), iii) fall back on a full-table sync and iv) then finally run a first binlog sync to 'catch up' the full-table sync. It's pretty neat. I believe the code path is similar for an explicit full-refresh, whereby you want to force a full-table sync before returning to binlog replication. I'd have to look again at |
Beta Was this translation helpful? Give feedback.
-
This came up in slack: https://meltano.slack.com/archives/C01TCRBBJD7/p1664439495487019
And also is related to LOG_BASED SDK discussion:
Challenge statement 1
The source data for MySQL (for instance) is migrating from one backend RDBMS instance to a new one. In the process of migrating the data, new binlogs are created which don't match the old identifiers.
As of now, the user would have to manually edit the
STATE
pointers to reflect the new binlog position pointers, or else start over with a full backfill.Challenge statement 2
In an initial backfill, FULL_TABLE or INCREMENTAL is used to quickly export all historical data. However, if there was no binlog position recorded, we don't have an easy way to transition from column-based incremental to log-based incremental.
Beta Was this translation helpful? Give feedback.
All reactions