Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emphasis the importance of input of unsafe recovery #19595

Merged
merged 5 commits into from
Dec 6, 2024
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions online-unsafe-recovery.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,19 @@ Before using Online Unsafe Recovery, make sure that the following requirements a

### Step 1. Specify the stores that cannot be recovered

To trigger automatic recovery, use PD Control to execute [`unsafe remove-failed-stores <store_id>[,<store_id>,...]`](/pd-control.md#unsafe-remove-failed-stores-store-ids--show) and specify **all** the TiKV nodes that cannot be recovered, separated by commas.

{{< copyable "shell-regular" >}}
To trigger automatic recovery, use PD Control to execute [`unsafe remove-failed-stores <store_id>[,<store_id>,...]`](/pd-control.md#unsafe-remove-failed-stores-store-ids--show) and specify **all** the TiKV and TiFlash nodes that cannot be recovered, separated by commas.

```bash
pd-ctl -u <pd_addr> unsafe remove-failed-stores <store_id1,store_id2,...>
```

> **Note:**
>
> - Make sure that **all** unrecoverable TiKV and TiFlash nodes are specified in the preceding command at once. Omitting any unrecoverable nodes might cause the recovery process to be blocked.
> - If you have already performed Online Unsafe Recovery within a short period (such as within a day), make sure that the subsequent execution of this command still include the previously processed TiKV and TiFlash nodes.
qiancai marked this conversation as resolved.
Show resolved Hide resolved

To specify the longest allowable duration of a recovery task, use the `--timeout <seconds>` option. If this option is not specified, the longest duration is 5 minutes by default. When the timeout occurs, the recovery is interrupted and returns an error.

If the command returns `Success`, PD Control has successfully registered the task to PD. This only means that the request has been accepted, not that the recovery has been successfully performed. The recovery task is performed in the background. To see the recovery progress, use [`show`](#step-2-check-the-recovery-progress-and-wait-for-the-completion).

If the command returns `Failed`, PD Control has failed to register the task to PD. The possible errors are as follows:
Expand All @@ -54,11 +59,15 @@ If the command returns `Failed`, PD Control has failed to register the task to P
- `invalid input store x doesn't exist`: The specified store ID does not exist.
- `invalid input store x is up and connected`: The specified store with the ID is still healthy and should not be recovered.

To specify the longest allowable duration of a recovery task, use the `--timeout <seconds>` option. If this option is not specified, the longest duration is 5 minutes by default. When the timeout occurs, the recovery is interrupted and returns an error.
If PD losses store information for unrecoverable TiKV nodes after disaster recovery operations such as [`pd-recover`](/pd-recover.md), which makes the specific store IDs unknown, you can use the `--auto-detect` mode. This mode enables PD to automatically remove replicas from TiKV nodes that are either unregistered or previously registered but forcibly deleted.
qiancai marked this conversation as resolved.
Show resolved Hide resolved

```bash
pd-ctl -u <pd_addr> unsafe remove-failed-stores --auto-detect
```

> **Note:**
>
> - Because this command needs to collect information from all peers, it might cause an increase in memory usage (100,000 peers are estimated to use 500 MiB of memory).
> - Because unsafe recovery needs to collect information from all peers, it might cause an increase in memory usage (100,000 peers are estimated to use 500 MiB of memory).
> - If PD restarts when the command is running, the recovery is interrupted and you need to trigger the task again.
> - Once the command is running, the specified stores will be set to the Tombstone status, and you cannot restart these stores.
> - When the command is running, all scheduling tasks and split/merge are paused and will be resumed automatically after the recovery is successful or fails.
Expand Down
Loading