MySQL Snapshot fails at 50m rows and keeps starting over and over again #812

poweroftrue · 2024-09-14T18:45:25Z

Hello there,

The lightweight version is a true magic, one container and you have near real time replication from MySQL to Clickhouse with all the columns moved right for you, a piece of art!

For a while now I'm debugging the issue java.lang.InterruptedException: null happens at ~50 million rows crashes and keep on happening.

java.lang.InterruptedException: null

	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1638) ~[?:?]

	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435) ~[?:?]

	at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:200) ~[?:?]

	at io.debezium.relational.RelationalSnapshotChangeEventSource.createDataEvents(RelationalSnapshotChangeEventSource.java:515) ~[app.jar:?]

	at io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:171) ~[app.jar:?]

	at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:96) ~[app.jar:?]

	at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:285) ~[app.jar:?]

	at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:269) ~[app.jar:?]

	at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:192) ~[app.jar:?]

	at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:143) ~[app.jar:?]

	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]

	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]

	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]

	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]

	at java.lang.Thread.run(Thread.java:831) [?:?]

What I have tired so far:

Increased all the timeout in MySQL to the max:

SELECT 
    variable_name, variable_value
FROM
    performance_schema.global_variables
WHERE
    variable_name IN ('interactive_timeout' , 'wait_timeout',
        'connect_timeout',
        'net_read_timeout',
        'net_write_timeout',
        'lock_wait_timeout',
        'innodb_lock_wait_timeout',
        'max_execution_time')
ORDER BY variable_name;

Increased memory and watched the used memory using jcmd 1 GC.heap_info inside the container:

So, it's not memory usage problem

Used different versions (branches) complied them and deployed to k8s but still same error.
Tried on different big tables, and it do the same but it breaks at different row size, the other table were smaller.

I'm using no locks snapshot.locking.mode: "none"

Related issue #722

What I'm trying now is attach local Intellij remote debugger, but I'm having hard time stopping on the right exception.

The text was updated successfully, but these errors were encountered:

aadant · 2024-09-14T20:35:00Z

@poweroftrue for MySQL, you can take a snapshot via MySQLshell (see python utilities in this project).
It takes a consistent snapshot, then you can restore it to CH and just start replicating from the snapshot position and even before.

You can dump and TB of data this way.

poweroftrue · 2024-09-14T22:01:13Z

Thanks @aadant!

Yes I checked the tool, it's amazing it dumps TBs of data very fast, I will use it after I try couple of weeks on this issue because it fit my use case better.

Having Debezium snapshot working open the door for simpler automation running this as a sidecar container for large number of MySQL services.

poweroftrue · 2024-09-15T03:11:19Z

Hello @aadant, I finally found what causes this issue!

I was debugging if certain rows causes the interruption but it wasn't, I enabled all logs (DEBUG) from log4j.xml and I found that Debezium was asked to stop.

It turned out to be restart.event.loop kicking in and restarting Debezium for not having events (normal during this kind of snapshot)

2024-09-12 03:03:11.146 INFO  - Stopping the embedded engine

2024-09-12 03:03:11.146 INFO  - Waiting for PT5M for connector to stop

2024-09-12 03:03:11.162 INFO  - Stopping the task and engine

2024-09-12 03:03:11.162 INFO  - Stopping down connector

2024-09-12 03:04:41.162 WARN  - Coordinator didn't stop in the expected time, shutting down executor now

I set it to false and now it's snapshotting rows beyond 100 million.

Should I add to docs that you need this disabled during the snapshot?

aadant · 2024-09-15T13:11:40Z

Good catch ! I think it is a bug. We should fix this as it is very misleading. The connector should only restart if it is idle (this is a workaround against a Debezium bug that should be fixed in the latest releases).

aadant · 2024-09-15T13:20:02Z

See #380
It is recommended to replicate with a heartbeat table to know if the connector is lagging.
If they are no updates, the connector will not update the replica_source_info table.

@subkanthi let us fix this even if there is a workaround.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL Snapshot fails at 50m rows and keeps starting over and over again #812

MySQL Snapshot fails at 50m rows and keeps starting over and over again #812

poweroftrue commented Sep 14, 2024

aadant commented Sep 14, 2024 •

edited

Loading

poweroftrue commented Sep 14, 2024

poweroftrue commented Sep 15, 2024

aadant commented Sep 15, 2024

aadant commented Sep 15, 2024

MySQL Snapshot fails at 50m rows and keeps starting over and over again #812

MySQL Snapshot fails at 50m rows and keeps starting over and over again #812

Comments

poweroftrue commented Sep 14, 2024

aadant commented Sep 14, 2024 • edited Loading

poweroftrue commented Sep 14, 2024

poweroftrue commented Sep 15, 2024

aadant commented Sep 15, 2024

aadant commented Sep 15, 2024

aadant commented Sep 14, 2024 •

edited

Loading