Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiKV initial scan task not cancel/termicated when changefeed paused or failed #11638

Open
fubinzh opened this issue Oct 9, 2024 · 9 comments
Assignees
Labels
affects-6.5 affects-7.1 affects-8.5 area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.

Comments

@fubinzh
Copy link

fubinzh commented Oct 9, 2024

What did you do?

  1. 07:51 - 07:55 Create 4 CDC clusters (each CDC cluster with 32C CDC nodes), for each CDC cluster create one changefeed, each changefeed replicating ~40k tables (overall ~160k tables)
  2. 07:58 Pause changefeed and update config
  3. 08:12 resume all changefeeds
  4. 09:38 Pause all changefeed
  5. 10:33 delete changefeed (Delete changefeeds from TiDB cloud UI)
  6. 10:07 - 10:33 manually kill and restart all TiKV

What did you expect to see?

  • When task paused, previous pending initial scan task should be cancelled, Otherwise this might block following changefeed initilization.

What did you see instead?

After step 1, one changefeed initilization finished (all tables added the changefeed)
After step 3, For the initilized changefeed, table count is 0 for 1.4h+ before it was paused.
After step 4, initial scan pending tasks is decreasing, but it is very slow.
After step 6, the pending tasks disappeared.

image
image

Versions of the cluster

v6.5.9

@fubinzh fubinzh added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Oct 9, 2024
@fubinzh
Copy link
Author

fubinzh commented Oct 9, 2024

Encountered similar issue when changefeed failed.

Changefeed failed at 10/03 00:00 during initial scan, but initial scan continues (and sorted disk actually increases) until I delete all changefeeds at 10/03 09:00.
image
image
image

@fubinzh
Copy link
Author

fubinzh commented Oct 9, 2024

/severity major

@3AceShowHand
Copy link
Contributor

Pause and remove changefeed, close the kv client, this should also close the gRPC connections to the TiKV-CDC, and cancel connections.

@3AceShowHand
Copy link
Contributor

I am trying to reproduce this scenario on my local test environment, here is some observations:

  • There are more than 100,000 regions in the TiKV cluster.

On the TiCDC side:

  • Resume the changefeed in around 22:03:07:265. connect to the TiKV, start incremental scan tasks.
  • Pause the changefeed in around 22:04:38.215. KV client receive a lot of region failed error, the last in 22:04:41:591. Which means for the KV client to fully closed, it takes around 3 seconds. There are 14954 errors, handled in 3 seconds, around 5000 / s.

On the TiKV side:

  • cdc register region found, a lot of regions registered.
  • The TiKV-CDC incremental scan workers thread count is default to 4, and incremental scan concurrency is 6. For each region, they are blocked on the incremental acquire the concurrency_semaphore
  • When pause the changefeed, TiKV-CDC deregister the connection, stop observing regions, cost around 3s also.
    • stop observe regions, also set the downstream state to stop, so the incremental scan can be canceld.
image
[2024/10/12 22:03:07.265 +08:00] [INFO] [middleware.go:48] [/api/v2/changefeeds/blackhole-test/resume] [status=200] [method=POST] [path=/api/v2/changefeeds/blackhole-test/resume] [query=] [ip=10.2.7.26] [user-agent=Go-http-client/1.1] [client-version=v6.5.9] [] [duration=146.190585ms]

[2024/10/12 22:04:38.215 +08:00] [INFO] [middleware.go:48] [/api/v1/changefeeds/blackhole-test/pause] [status=202] [method=POST] [path=/api/v1/changefeeds/blackhole-test/pause] [query=] [ip=10.2.7.26] [user-agent=Go-http-client/1.1] [client-version=v6.5.9] [] [duration=78.699473ms]

[2024/10/12 22:04:41.591 +08:00] [INFO] [client.go:613] ["region failed"]

@3AceShowHand
Copy link
Contributor

image * The changefeed paused around 09:40 image
  • The event feed count is not dropped to 0 immediately
image
  • KV Client still receive a lot of kv entries

@3AceShowHand
Copy link
Contributor

3AceShowHand commented Oct 16, 2024

[2024/10/02 01:38:47.996 +00:00] [INFO] [feed_state_manager.go:243] ["handle admin job"] [namespace=default] [changefeed=changefeed-1841264655229501440] [job="{\"CfID\":{\"Namespace\":\"default\",\"ID\":\"changefeed-1841264655229501440\"},\"Type\":1,\"Error\":null,\"OverwriteCheckpointTs\":0}"]


[2024/10/02 01:38:47.996 +00:00] [INFO] [ddl_puller.go:692] ["close the ddl puller"] [namespace=default] [changefeed=changefeed-1841264655229501440_ddl_puller]

[2024/10/02 01:38:48.091 +00:00] [INFO] [changefeed.go:723] ["changefeed closed"] [namespace=default] [changefeed=changefeed-1841264655229501440]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:132] ["Closing source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:140] ["All pullers have been closed"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=181ns]

[2024/10/02 01:38:48.098 +00:00] [INFO] [manager.go:147] ["Closed source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=44.248µs]

From the logs above, we know that both changefeed and processor closed.

But there is still some logs indicates that the kv client fetch data. the region report error, and then restart to connect to that region.

[2024/10/02 01:38:48.099 +00:00] [INFO] [client.go:613] ["region failed"] [span="[748000fffffffffffffe5f728000000000ff1f11290000000000fa, 748000fffffffffffffe5f728000000000ff1f1a570000000000fa)"] [regionId=1504] [error="[CDC:ErrPendingRegionCancel]pending region cancelled due to stream disconnecting"]
[2024/10/02 01:38:48.146 +00:00] [INFO] [client.go:1462] ["event feed starts to check locked regions"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=67502] 
...

This logs indicates that the processor is blocked for a long time, this may be the root cause of the issue.


[2024/10/01 23:59:09.085 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=10.00036425s] [role=processor]

[2024/10/02 01:38:48.093 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=1h39m44.999948235s] [role=processor]
[2024/10/02 01:38:48.194 +00:00] [INFO] [puller.go:293] ["table puller has been stucked"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=82046] [tableName=`db_218`.`dh_pay_customer_0`] [resolvedTs=452818737561600000] [slowestRangeTs=452818737561600000] [range="[7480000000000140ff7e5f72ffffffffffffffffff0000000000fb, 7480000000000140ff7e5f730000000000fa)"]

[2024/10/02 00:00:58.937 +00:00] [INFO] [client.go:1485] ["event feed finds a uninitialized slow region"] [namespace=default] [changefeed=changefeed-1841264655229501440] [tableID=22030] [tableName=`db_134`.`dh_user_withdraw_order_0`] [slowRegion="{\"RegionID\":2176379,\"CheckpointTs\":452818737561600000,\"Initialized\":false,\"Created\":\"2024-10-01T23:54:58.945274175Z\"}"]

@3AceShowHand
Copy link
Contributor

3AceShowHand commented Oct 16, 2024

[2024/10/01 23:58:50.650 +00:00] [INFO] [processor.go:1242] ["Processor try to close source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
[2024/10/01 23:58:50.650 +00:00] [INFO] [manager.go:132] ["Closing source manager"] [namespace=default] [changefeed=changefeed-1841264655229501440]
...
[2024/10/02 00:20:13.166 +00:00] [INFO] [manager.go:140] ["All pullers have been closed"] [namespace=default] [changefeed=changefeed-1841264655229501440] [cost=21m22.516489165s]

There is one log indicates that close the puller takes more than 21 minutes. another processor does not have such logs, and blocked for more than 2.5 hours.

[2024/10/02 02:32:39.092 +00:00] [WARN] [client.go:271] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=2h33m35.999487376s] [role=processor]

@3AceShowHand
Copy link
Contributor

3AceShowHand commented Oct 16, 2024

The root cause of the issue comes from the kv client cannot unregister the event handle from the worker pool immediately.

When close the puller, kv client also close, it has to unregister the event handler from the worker pool first, this cost too much time, makes each puller cannot be closed immediately. When there are a lot of tables, each table has one puller, so that some puller not closed, still works normally.

This issue should not exist after v7.5, since the kv client was refactored, workerpool is no more used.

[2024/10/02 02:04:51.764 +00:00] [WARN] [pool_impl.go:444] ["synchronize is taking too long, report a bug"] [duration=10.000434501s]

@wk989898
Copy link
Collaborator

This issue should not exist after v7.5, since the kv client was refactored, workerpool is no more used.

still find workerpool in v7.5

tiflow/cdc/server/server.go

Lines 350 to 352 in d2c35e4

eg.Go(func() error {
return kv.RunWorkerPool(egCtx)
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 affects-7.1 affects-8.5 area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

3 participants