Backup(BR) is getting failed for huge table count #58513

kabileshKabi · 2024-12-24T13:21:38Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Issue Description:
Our application is a SAAS based Multi-tenancy application with each tenant will have a DB , in which we have more than 14k databases and having more than 600k Tables.

While we have a strick backup requirement when we run the BR full backup its not showing any progress and gets failed with some RPC error

Command and Log:

Starting component br: /home/ec2-user/.tiup/components/br/v8.1.1/br backup full --pd 10.1.3.190:2379 --storage s3://us-chat-db-tidb-backup/test_backupdec_232024 --log-file backupdec232024.log
Detail BR log in backupdec232024.log 

Full Backup <..................................................................................................................................................................> 0.00%```


Am also attaching the details logs as well


### 2. What did you expect to see? (Required)

We expect the backup to run completely and consistently

### 3. What did you see instead (Required)
we see backup failures and BR backup is not happening

### 4. What is your TiDB version? (Required)

```mysql> select tidb_version();
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_version()                                                                                                                                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Release Version: v8.1.1
Edition: Community
Git Commit Hash: a7df4f9845d5d6a590c5d45dad0dcc9f21aa8765
Git Branch: HEAD
UTC Build Time: 2024-08-22 05:49:03
GoVersion: go1.21.13
Race Enabled: false
Check Table Before Drop: false
Store: tikv |```[backup_log.txt](https://github.com/user-attachments/files/18239868/backup_log.txt)

The text was updated successfully, but these errors were encountered:

BornChanger · 2024-12-24T23:40:23Z

@kabileshKabi thx for reporting the issue! The backup log seems it's truncated. Was the backup task was killed? was the br pod oom?

BornChanger · 2024-12-24T23:45:40Z

and your cluster has 600k tables, the backup initialization phase needs some time to load schema. the errors in the log attached are transient instead of fatal. @kabileshKabi

kabileshKabi · 2024-12-25T04:42:25Z

@kabileshKabi thx for reporting the issue! The backup log seems it's truncated. Was the backup task was killed? was the br pod oom?

The BR is running in VM node, Backup is not getting killed it just exits out

kabileshKabi · 2024-12-25T04:43:02Z

and your cluster has 600k tables, the backup initialization phase needs some time to load schema. the errors in the log attached are transient instead of fatal. @kabileshKabi

This seems to be the issue as it needs to load the schema and its getting a RPC time-out

kabileshKabi · 2024-12-25T05:04:28Z

am sharing the log again:

command execuetion log:

Starting component br: /home/ec2-user/.tiup/components/br/v8.1.1/br backup full --pd 10.1.3.190:2379 --storage s3://us-chat-db-tidb-backup/test_backupdec252024 --log-file backupdec252024.log
Detail BR log in backupdec252024.log


Full Backup <...................................................................................................................................................................> 0.00%

[ec2-user@ip-10-1```

The above command is fired from the tiup node and just exits as above, without printing any error, we just get a Info stating as below. When the command is in execution we also noticed a high memory and CPU usage in the tiup node

```[2024/12/25 04:53:54.515 +00:00] [INFO] [backup.go:491] ["current backup safePoint job"] [safePoint="{ID=br-cfab8f42-1743-412b-b509-4e8661991690,TTL=1h12m0s,BackupTime=\"2024-12-25 04:53:54.48 +0000 UTC\",BackupTS=454846692584325250}"]
[2024/12/25 04:53:54.563 +00:00] [INFO] [manager.go:261] ["break campaign loop, NewSession failed"] ["owner info"="[log-backup] /tidb/br-stream/owner ownerManager 72161622-c91b-4405-857a-a5553d977546"] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\t/root/go/pkg/mod/github.com/pingcap/[email protected]/errors.go:178\ngithub.com/pingcap/errors.Trace\n\t/root/go/pkg/mod/github.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tidb/pkg/util.contextDone\n\t/workspace/source/tidb/pkg/util/etcd.go:90\ngithub.com/pingcap/tidb/pkg/util.NewSession\n\t/workspace/source/tidb/pkg/util/etcd.go:50\ngithub.com/pingcap/tidb/pkg/owner.(*ownerManager).campaignLoop\n\t/workspace/source/tidb/pkg/owner/manager.go:259\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197"]
[2024/12/25 04:53:55.840 +00:00] [INFO] [manager.go:317] ["revoke session"] ["owner info"="[log-backup] /tidb/br-stream/owner ownerManager 72161622-c91b-4405-857a-a5553d977546"] [error="rpc error: code = Canceled desc = grpc: the client connection is closing"]
[2024/12/25 04:54:57.889 +00:00] [INFO] [client.go:531] ["backup empty database"] [db=test]
[2024/12/25 04:55:19.481 +00:00] [INFO] [client.go:531] ["backup empty database"] [db=testplay]
[2024/12/25 04:55:51.300 +00:00] [INFO] [backup.go:578] ["get placement policies"] [count=0]```

Am also here with attaching the full logs for this.

Thank you,[backup_log_new.txt](https://github.com/user-attachments/files/18243431/backup_log_new.txt)

BornChanger · 2024-12-25T06:10:34Z

@kabileshKabi if the backup log is just as it is，i think it's likely the br node (the host of tiup) was oom. can you double check? and if it's the reason, please use higher end vm (say 64GiB memory) to run backup?

kabileshKabi · 2024-12-25T08:01:49Z

@BornChanger is there any way to limit the memory usage of BR

i can see it being killed by OOM in tiup node:
[917931.146960] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-9441.scope,task=br,pid=3006955,uid=1000
[917931.148603] Out of memory: Killed process 3006955 (br) total-vm:14123036kB, anon-rss:11402632kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:22896kB oom_score_adj:0

kabileshKabi · 2024-12-26T03:02:46Z

@BornChanger
we had increased the memory of tiup to 32GB and initiated the backup, it was running fine, it had failed as below

Logs:

[2024/12/25 18:22:29.182 +00:00] [INFO] [client.go:842] ["Backup Ranges Completed"] [take=7h28m40.300275186s]
[2024/12/25 18:22:29.182 +00:00] [INFO] [backup.go:674] ["wait for flush checkpoint..."]
[2024/12/25 18:22:29.182 +00:00] [INFO] [checkpoint.go:455] ["stop checkpoint runner"]
[2024/12/25 18:22:29.182 +00:00] [INFO] [checkpoint.go:342] ["stop checkpoint flush worker"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [backup.go:497] ["skip removing gc-safepoint keeper for next retry"] [gc-id=br-3d999896-00a0-4f88-b43a-61662eb7066c]
[2024/12/25 18:22:29.217 +00:00] [INFO] [pd_service_discovery.go:550] ["[pd] exit member loop due to context canceled"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_dispatcher.go:423] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_dispatcher.go:348] ["[tso] exit tso dispatcher"] [dc-location=global]
[2024/12/25 18:22:29.217 +00:00] [INFO] [resource_manager_client.go:290] ["[resource manager] exit resource token dispatcher"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_batch_controller.go:158] ["[pd] clear the tso batch controller"] [max-batch-size=10000] [best-batch-size=1] [collected-request-count=0] [pending-request-count=0]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_client.go:124] ["closing tso client"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_dispatcher.go:182] ["exit tso requests cancel loop"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_dispatcher.go:236] ["exit tso dispatcher loop"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_client.go:129] ["close tso client"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_batch_controller.go:158] ["[pd] clear the tso batch controller"] [max-batch-size=10000] [best-batch-size=1] [collected-request-count=0] [pending-request-count=0]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_client.go:139] ["tso client is closed"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [pd_service_discovery.go:637] ["[pd] close pd service discovery client"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [client.go:347] ["[pd] http client closed"] [source=tikv-driver]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_dispatcher.go:182] ["exit tso requests cancel loop"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_dispatcher.go:236] ["exit tso dispatcher loop"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [pd_service_discovery.go:550] ["[pd] exit member loop due to context canceled"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_dispatcher.go:423] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_dispatcher.go:348] ["[tso] exit tso dispatcher"] [dc-location=global]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_batch_controller.go:158] ["[pd] clear the tso batch controller"] [max-batch-size=10000] [best-batch-size=1] [collected-request-count=0] [pending-request-count=0]
[2024/12/25 18:22:29.218 +00:00] [INFO] [resource_manager_client.go:290] ["[resource manager] exit resource token dispatcher"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_client.go:124] ["closing tso client"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_client.go:129] ["close tso client"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_batch_controller.go:158] ["[pd] clear the tso batch controller"] [max-batch-size=10000] [best-batch-size=1] [collected-request-count=0] [pending-request-count=0]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_client.go:139] ["tso client is closed"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [pd_service_discovery.go:637] ["[pd] close pd service discovery client"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [client.go:347] ["[pd] http client closed"] [source="br/lightning PD controller"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [metafile.go:725] ["exit write metas by context done"]
[2024/12/25 18:22:29.413 +00:00] [INFO] [collector.go:224] ["units canceled"] [cancel-unit=0]
2024-12-25 18:23:02.678883393 +0000 UTC m=+27134.120592011 write error: write length 982132449 exceeds maximum file size 314572800
[2024/12/25 18:23:11.121 +00:00] [ERROR] [backup.go:57] ["failed to backup"] [error="at fine-grained backup, remained ranges = 0: backoff exceeds the max backoff time 1h0m0s: [BR:Common:ErrUnknown]internal error"] [errorVerbose="[BR:Common:ErrUnknown]internal error\nbackoff exceeds the max backoff time 1h0m0s\ngithub.com/pingcap/tidb/br/pkg/utils.(*RetryWithBackoffer).BackOff\n\t/workspace/source/tidb/br/pkg/utils/retry.go:348\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).fineGrainedBackup\n\t/workspace/source/tidb/br/pkg/backup/client.go:1210\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/workspace/source/tidb/br/pkg/backup/client.go:958\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges.func2\n\t/workspace/source/tidb/br/pkg/backup/client.go:864\ngithub.com/pingcap/tidb/pkg/util.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/workspace/source/tidb/pkg/util/worker_pool.go:81\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/root/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197\nat fine-grained backup, remained ranges = 0"] [stack="main.runBackupCommand\n\t/workspace/source/tidb/br/cmd/br/backup.go:57\nmain.newFullBackupCommand.func1\n\t/workspace/source/tidb/br/cmd/br/backup.go:148\ngithub.com/spf13/cobra.(*Command).execute\n\t/root/go/pkg/mod/github.com/spf13/[email protected]/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/pkg/mod/github.com/spf13/[email protected]/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/root/go/pkg/mod/github.com/spf13/[email protected]/command.go:1039\nmain.main\n\t/workspace/source/tidb/br/cmd/br/main.go:36\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"]
[2024/12/25 18:23:11.122 +00:00] [ERROR] [main.go:38] ["br failed"] [error="at fine-grained backup, remained ranges = 0: backoff exceeds the max backoff time 1h0m0s: [BR:Common:ErrUnknown]internal error"] [errorVerbose="[BR:Common:ErrUnknown]internal error\nbackoff exceeds the max backoff time 1h0m0s\ngithub.com/pingcap/tidb/br/pkg/utils.(*RetryWithBackoffer).BackOff\n\t/workspace/source/tidb/br/pkg/utils/retry.go:348\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).fineGrainedBackup\n\t/workspace/source/tidb/br/pkg/backup/client.go:1210\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/workspace/source/tidb/br/pkg/backup/client.go:958\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges.func2\n\t/workspace/source/tidb/br/pkg/backup/client.go:864\ngithub.com/pingcap/tidb/pkg/util.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/workspace/source/tidb/pkg/util/worker_pool.go:81\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/root/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197\nat fine-grained backup, remained ranges = 0"] [stack="main.main\n\t/workspace/source/tidb/br/cmd/br/main.go:38\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"]```

BornChanger · 2024-12-26T03:46:57Z

@kabileshKabi it seems the backup is almost there. please try rerun the same backup command to restart the job which will take advantage of checkpoint first and see if it works.

3pointer · 2024-12-26T03:47:49Z

From the provided error information:

During the backup phase, the specific ranges failed to back up, and the process could not complete even after more than 1 hour, resulting in a failure.

This type of error is relatively rare. Could you provide more complete backup logs to help us analyze the issue?

We believe the logs in the last hour will contain a large number of start fine grained backup entries, indicating that BR was repeatedly attempting to back up but was unable to complete the process.

We need to review the full request logs to identify which specific range caused the issue.

kabileshKabi · 2024-12-26T04:23:09Z

I have re-initiated the same command @BornChanger , i see its progressing we have kept the GC life time to 48hrs

@3pointer am attaching the collected logs debug_br.log

BornChanger · 2024-12-26T05:46:35Z

@kabileshKabi in the new log, there are screen snapshot of the shell command. can you please send us the log of the failed backup instead?

kabileshKabi · 2024-12-26T06:11:41Z

@BornChanger the above log is for the failed backup only

BornChanger · 2024-12-26T06:13:07Z

please send us the complete backup log if possible.

kabileshKabi · 2024-12-26T06:13:56Z

@BornChanger one more update the backup which we have restarted has been completed now

kabileshKabi added the type/bug The issue is confirmed as a bug. label Dec 24, 2024

BornChanger added the component/br This issue is related to BR of TiDB. label Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backup(BR) is getting failed for huge table count #58513

Backup(BR) is getting failed for huge table count #58513

kabileshKabi commented Dec 24, 2024

BornChanger commented Dec 24, 2024 •

edited

Loading

BornChanger commented Dec 24, 2024 •

edited

Loading

kabileshKabi commented Dec 25, 2024

kabileshKabi commented Dec 25, 2024

kabileshKabi commented Dec 25, 2024 •

edited

Loading

BornChanger commented Dec 25, 2024 •

edited

Loading

kabileshKabi commented Dec 25, 2024 •

edited

Loading

kabileshKabi commented Dec 26, 2024

BornChanger commented Dec 26, 2024 •

edited

Loading

3pointer commented Dec 26, 2024

kabileshKabi commented Dec 26, 2024 •

edited by 3pointer

Loading

BornChanger commented Dec 26, 2024

kabileshKabi commented Dec 26, 2024

BornChanger commented Dec 26, 2024

kabileshKabi commented Dec 26, 2024

Backup(BR) is getting failed for huge table count #58513

Backup(BR) is getting failed for huge table count #58513

Comments

kabileshKabi commented Dec 24, 2024

Bug Report

1. Minimal reproduce step (Required)

BornChanger commented Dec 24, 2024 • edited Loading

BornChanger commented Dec 24, 2024 • edited Loading

kabileshKabi commented Dec 25, 2024

kabileshKabi commented Dec 25, 2024

kabileshKabi commented Dec 25, 2024 • edited Loading

BornChanger commented Dec 25, 2024 • edited Loading

kabileshKabi commented Dec 25, 2024 • edited Loading

kabileshKabi commented Dec 26, 2024

BornChanger commented Dec 26, 2024 • edited Loading

3pointer commented Dec 26, 2024

kabileshKabi commented Dec 26, 2024 • edited by 3pointer Loading

BornChanger commented Dec 26, 2024

kabileshKabi commented Dec 26, 2024

BornChanger commented Dec 26, 2024

kabileshKabi commented Dec 26, 2024

BornChanger commented Dec 24, 2024 •

edited

Loading

BornChanger commented Dec 24, 2024 •

edited

Loading

kabileshKabi commented Dec 25, 2024 •

edited

Loading

BornChanger commented Dec 25, 2024 •

edited

Loading

kabileshKabi commented Dec 25, 2024 •

edited

Loading

BornChanger commented Dec 26, 2024 •

edited

Loading

kabileshKabi commented Dec 26, 2024 •

edited by 3pointer

Loading