cannot get cluster id after scale-out & scale-in pd #8993

lidezhu · 2025-01-13T09:55:32Z

Bug Report

What did you do?

tiup playground nightly --db 1 --kv 1 --pd 1 --ticdc 1 --tiflash 0 --without-monitor

# perform scale-out
tiup playground scale-out --pd 1

# note the PIDs of the first PDs
tiup playground display 

# perform scale-in
tiup playground scale-in --pid 23397

What did you expect to see?

cdc can still use pd client get cluster id.

What did you see instead?

cdc cannot use pd client to get cluster id. And there is lot of error like following from pd client:

[2025/01/13 17:22:47.279 +08:00] [INFO] [pd_service_discovery.go:913] ["[pd] cannot update member from this url"] [url=http://127.0.0.1:56335] [error="[PD:client:ErrClientGetLeader]get leader failed, leader url doesn't exist"]
[2025/01/13 17:22:47.681 +08:00] [INFO] [pd_service_discovery.go:913] ["[pd] cannot update member from this url"] [url=http://127.0.0.1:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\" target:127.0.0.1:2379 statu
[2025/01/13 17:22:53.279 +08:00] [ERROR] [pd_service_discovery.go:560] ["[pd] failed to update member"] [urls="[http://127.0.0.1:2379,http://127.0.0.1:56335]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2025/01/13 17:23:54.344 +08:00] [WARN] [pd_service_discovery.go:837] ["[pd] failed to get cluster id"] [url=http://127.0.0.1:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\" target:127.0.0.1:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\" target:127.0.0.1:2379 status:TRANSIENT_FAILURE"]

What version of PD are you using (`pd-server -V`)?

nightly

The text was updated successfully, but these errors were encountered:

kennytm · 2025-01-13T16:24:10Z

is PID 23397 the original leader?

and perhaps you should extract the minimal code for reproduction from cdc that demonstrates the issue.

rleungx · 2025-01-14T06:24:40Z

@lidezhu Is this issue related to the Note we mentioned in https://docs.pingcap.com/tidb/stable/scale-tidb-using-tiup/#scale-in-a-tidbpdtikv-cluster?

lidezhu · 2025-01-14T06:45:04Z

@rleungx Yes, seems it is the root cause of the issue, let me try it.

lidezhu · 2025-01-14T07:24:04Z

pdlogs.tar.gz

lidezhu · 2025-01-14T07:25:11Z

deploy a cluster with 3 pds;
scale out 3 new pds;
transfer pd leader manually to one of the 3 new pds;
stop the 3 old pds, 3 new pds are down;

Logs are uploaded, PTAL @rleungx @okJiang

okJiang · 2025-01-14T08:39:43Z

@lidezhu

The PD Client in TiKV caches the list of PD nodes. The current version of TiKV has a mechanism to automatically and regularly update PD nodes, which can help mitigate the issue of an expired list of PD nodes cached by TiKV. However, after scaling out PD, you should try to avoid directly removing all PD nodes at once that exist before the scaling. If necessary, before making all the previously existing PD nodes offline, make sure to switch the PD leader to a newly added PD node.

And you need reload your cluster after scaling?

This should be a translation issue.

lidezhu · 2025-01-14T08:51:09Z

Why is it necessary to reload the cluster after scaling out the PD?
The issue here is that stopping the 3 old PD nodes will cause the 3 new PD nodes to go down as well.
Can the reload operation help prevent this situation? @okJiang

kennytm · 2025-01-14T11:09:20Z

for context, in our real-world customer scenario they have performed the following in order:

scale-out 3 PDs
transfer PD leader
scale-in 3 PDs
(noticed that changefeed can't be created)
reload --skip-restart
(still can't create changefeed)

The customer DOES NOT want to perform a full reload of every component, as this introduces a non-negligible down-time.

(The changefeed issue is fixed by restarting all TiCDC components)

lidezhu · 2025-01-14T12:57:54Z

@kennytm Does the customer do the "transfer PD leader" operation before scale-in old PDs? Seems you don't provide this info in the original issue?

rleungx · 2025-01-15T03:13:15Z

deploy a cluster with 3 pds;

scale out 3 new pds;

transfer pd leader manually to one of the 3 new pds;

stop the 3 old pds, 3 new pds are down;

Logs are uploaded, PTAL @rleungx @okJiang

It seems you don't scale in and stop 3 PDs? In this way, PD will lose quorum.

lidezhu · 2025-01-15T04:16:22Z

Got it. I used stop instead of scale-in to preserve pd logs. I will use scale-in to try again.

lidezhu · 2025-01-15T08:27:50Z

It's ok to do transfer leader before scale-in pd. Thanks for the support.

lidezhu added the type/bug The issue is confirmed as a bug. label Jan 13, 2025

lidezhu mentioned this issue Jan 13, 2025

The CDC client is still using the old PD address pingcap/tiflow#9584

Closed

lidezhu closed this as completed Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot get cluster id after scale-out & scale-in pd #8993

cannot get cluster id after scale-out & scale-in pd #8993

lidezhu commented Jan 13, 2025 •

edited

Loading

kennytm commented Jan 13, 2025

rleungx commented Jan 14, 2025

lidezhu commented Jan 14, 2025

lidezhu commented Jan 14, 2025

lidezhu commented Jan 14, 2025 •

edited

Loading

okJiang commented Jan 14, 2025

lidezhu commented Jan 14, 2025

kennytm commented Jan 14, 2025 •

edited

Loading

lidezhu commented Jan 14, 2025

rleungx commented Jan 15, 2025

lidezhu commented Jan 15, 2025

lidezhu commented Jan 15, 2025

cannot get cluster id after scale-out & scale-in pd #8993

cannot get cluster id after scale-out & scale-in pd #8993

Comments

lidezhu commented Jan 13, 2025 • edited Loading

Bug Report

What did you do?

What did you expect to see?

What did you see instead?

What version of PD are you using (pd-server -V)?

kennytm commented Jan 13, 2025

rleungx commented Jan 14, 2025

lidezhu commented Jan 14, 2025

lidezhu commented Jan 14, 2025

lidezhu commented Jan 14, 2025 • edited Loading

okJiang commented Jan 14, 2025

lidezhu commented Jan 14, 2025

kennytm commented Jan 14, 2025 • edited Loading

lidezhu commented Jan 14, 2025

rleungx commented Jan 15, 2025

lidezhu commented Jan 15, 2025

lidezhu commented Jan 15, 2025

lidezhu commented Jan 13, 2025 •

edited

Loading

What version of PD are you using (`pd-server -V`)?

lidezhu commented Jan 14, 2025 •

edited

Loading

kennytm commented Jan 14, 2025 •

edited

Loading