Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot get cluster id after scale-out & scale-in pd #8993

Closed
lidezhu opened this issue Jan 13, 2025 · 12 comments
Closed

cannot get cluster id after scale-out & scale-in pd #8993

lidezhu opened this issue Jan 13, 2025 · 12 comments
Labels
type/bug The issue is confirmed as a bug.

Comments

@lidezhu
Copy link

lidezhu commented Jan 13, 2025

Bug Report

What did you do?

tiup playground nightly --db 1 --kv 1 --pd 1 --ticdc 1 --tiflash 0 --without-monitor

# perform scale-out
tiup playground scale-out --pd 1

# note the PIDs of the first PDs
tiup playground display 

# perform scale-in
tiup playground scale-in --pid 23397

What did you expect to see?

cdc can still use pd client get cluster id.

What did you see instead?

cdc cannot use pd client to get cluster id. And there is lot of error like following from pd client:

[2025/01/13 17:22:47.279 +08:00] [INFO] [pd_service_discovery.go:913] ["[pd] cannot update member from this url"] [url=http://127.0.0.1:56335] [error="[PD:client:ErrClientGetLeader]get leader failed, leader url doesn't exist"]
[2025/01/13 17:22:47.681 +08:00] [INFO] [pd_service_discovery.go:913] ["[pd] cannot update member from this url"] [url=http://127.0.0.1:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\" target:127.0.0.1:2379 statu
[2025/01/13 17:22:53.279 +08:00] [ERROR] [pd_service_discovery.go:560] ["[pd] failed to update member"] [urls="[http://127.0.0.1:2379,http://127.0.0.1:56335]"] [error="[PD:client:ErrClientGetMember]get member failed"]
[2025/01/13 17:23:54.344 +08:00] [WARN] [pd_service_discovery.go:837] ["[pd] failed to get cluster id"] [url=http://127.0.0.1:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\" target:127.0.0.1:2379 status:TRANSIENT_FAILURE: error:rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\" target:127.0.0.1:2379 status:TRANSIENT_FAILURE"]

What version of PD are you using (pd-server -V)?

nightly

@kennytm
Copy link

kennytm commented Jan 13, 2025

is PID 23397 the original leader?

and perhaps you should extract the minimal code for reproduction from cdc that demonstrates the issue.

@rleungx
Copy link
Member

rleungx commented Jan 14, 2025

@lidezhu Is this issue related to the Note we mentioned in https://docs.pingcap.com/tidb/stable/scale-tidb-using-tiup/#scale-in-a-tidbpdtikv-cluster?

@lidezhu
Copy link
Author

lidezhu commented Jan 14, 2025

@rleungx Yes, seems it is the root cause of the issue, let me try it.

@lidezhu
Copy link
Author

lidezhu commented Jan 14, 2025

pdlogs.tar.gz

@lidezhu
Copy link
Author

lidezhu commented Jan 14, 2025

  1. deploy a cluster with 3 pds;
  2. scale out 3 new pds;
  3. transfer pd leader manually to one of the 3 new pds;
    Image
  4. stop the 3 old pds, 3 new pds are down;
    Image

Logs are uploaded, PTAL @rleungx @okJiang

@okJiang
Copy link
Member

okJiang commented Jan 14, 2025

@lidezhu

The PD Client in TiKV caches the list of PD nodes. The current version of TiKV has a mechanism to automatically and regularly update PD nodes, which can help mitigate the issue of an expired list of PD nodes cached by TiKV. However, after scaling out PD, you should try to avoid directly removing all PD nodes at once that exist before the scaling. If necessary, before making all the previously existing PD nodes offline, make sure to switch the PD leader to a newly added PD node.

And you need reload your cluster after scaling?

This should be a translation issue.

Image Image

@lidezhu
Copy link
Author

lidezhu commented Jan 14, 2025

Why is it necessary to reload the cluster after scaling out the PD?
The issue here is that stopping the 3 old PD nodes will cause the 3 new PD nodes to go down as well.
Can the reload operation help prevent this situation? @okJiang

@kennytm
Copy link

kennytm commented Jan 14, 2025

for context, in our real-world customer scenario they have performed the following in order:

  1. scale-out 3 PDs
  2. transfer PD leader
  3. scale-in 3 PDs
  4. (noticed that changefeed can't be created)
  5. reload --skip-restart
  6. (still can't create changefeed)

The customer DOES NOT want to perform a full reload of every component, as this introduces a non-negligible down-time.

(The changefeed issue is fixed by restarting all TiCDC components)

@lidezhu
Copy link
Author

lidezhu commented Jan 14, 2025

@kennytm Does the customer do the "transfer PD leader" operation before scale-in old PDs? Seems you don't provide this info in the original issue?

@rleungx
Copy link
Member

rleungx commented Jan 15, 2025

  1. deploy a cluster with 3 pds;
  2. scale out 3 new pds;
  3. transfer pd leader manually to one of the 3 new pds;
    Image
  4. stop the 3 old pds, 3 new pds are down;
    Image

Logs are uploaded, PTAL @rleungx @okJiang

It seems you don't scale in and stop 3 PDs? In this way, PD will lose quorum.

@lidezhu
Copy link
Author

lidezhu commented Jan 15, 2025

Got it. I used stop instead of scale-in to preserve pd logs. I will use scale-in to try again.

@lidezhu
Copy link
Author

lidezhu commented Jan 15, 2025

It's ok to do transfer leader before scale-in pd. Thanks for the support.

@lidezhu lidezhu closed this as completed Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

4 participants