Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL Snapshot fails at 50m rows and keeps starting over and over again #812

Open
poweroftrue opened this issue Sep 14, 2024 · 5 comments
Open

Comments

@poweroftrue
Copy link

Hello there,

The lightweight version is a true magic, one container and you have near real time replication from MySQL to Clickhouse with all the columns moved right for you, a piece of art!

For a while now I'm debugging the issue java.lang.InterruptedException: null happens at ~50 million rows crashes and keep on happening.

java.lang.InterruptedException: null

	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1638) ~[?:?]

	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435) ~[?:?]

	at java.util.concurrent.ExecutorCompletionService.take(ExecutorCompletionService.java:200) ~[?:?]

	at io.debezium.relational.RelationalSnapshotChangeEventSource.createDataEvents(RelationalSnapshotChangeEventSource.java:515) ~[app.jar:?]

	at io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:171) ~[app.jar:?]

	at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:96) ~[app.jar:?]

	at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:285) ~[app.jar:?]

	at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:269) ~[app.jar:?]

	at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:192) ~[app.jar:?]

	at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:143) ~[app.jar:?]

	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]

	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]

	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]

	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]

	at java.lang.Thread.run(Thread.java:831) [?:?]

What I have tired so far:

  1. Increased all the timeout in MySQL to the max:
SELECT 
    variable_name, variable_value
FROM
    performance_schema.global_variables
WHERE
    variable_name IN ('interactive_timeout' , 'wait_timeout',
        'connect_timeout',
        'net_read_timeout',
        'net_write_timeout',
        'lock_wait_timeout',
        'innodb_lock_wait_timeout',
        'max_execution_time')
ORDER BY variable_name;

image

  1. Increased memory and watched the used memory using jcmd 1 GC.heap_info inside the container:
image

So, it's not memory usage problem

  1. Used different versions (branches) complied them and deployed to k8s but still same error.
  2. Tried on different big tables, and it do the same but it breaks at different row size, the other table were smaller.

I'm using no locks snapshot.locking.mode: "none"

Related issue #722

What I'm trying now is attach local Intellij remote debugger, but I'm having hard time stopping on the right exception.

@aadant
Copy link
Collaborator

aadant commented Sep 14, 2024

@poweroftrue for MySQL, you can take a snapshot via MySQLshell (see python utilities in this project).
It takes a consistent snapshot, then you can restore it to CH and just start replicating from the snapshot position and even before.

You can dump and TB of data this way.

@poweroftrue
Copy link
Author

Thanks @aadant!

Yes I checked the tool, it's amazing it dumps TBs of data very fast, I will use it after I try couple of weeks on this issue because it fit my use case better.

Having Debezium snapshot working open the door for simpler automation running this as a sidecar container for large number of MySQL services.

@poweroftrue
Copy link
Author

Hello @aadant, I finally found what causes this issue!

I was debugging if certain rows causes the interruption but it wasn't, I enabled all logs (DEBUG) from log4j.xml and I found that Debezium was asked to stop.

It turned out to be restart.event.loop kicking in and restarting Debezium for not having events (normal during this kind of snapshot)

2024-09-12 03:03:11.146 INFO  - Stopping the embedded engine

2024-09-12 03:03:11.146 INFO  - Waiting for PT5M for connector to stop

2024-09-12 03:03:11.162 INFO  - Stopping the task and engine

2024-09-12 03:03:11.162 INFO  - Stopping down connector

2024-09-12 03:04:41.162 WARN  - Coordinator didn't stop in the expected time, shutting down executor now

I set it to false and now it's snapshotting rows beyond 100 million.

Should I add to docs that you need this disabled during the snapshot?

@aadant
Copy link
Collaborator

aadant commented Sep 15, 2024

Good catch ! I think it is a bug. We should fix this as it is very misleading. The connector should only restart if it is idle (this is a workaround against a Debezium bug that should be fixed in the latest releases).

@aadant
Copy link
Collaborator

aadant commented Sep 15, 2024

See #380
It is recommended to replicate with a heartbeat table to know if the connector is lagging.
If they are no updates, the connector will not update the replica_source_info table.

@subkanthi let us fix this even if there is a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants