Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backup(BR) is getting failed for huge table count #58513

Open
kabileshKabi opened this issue Dec 24, 2024 · 15 comments
Open

Backup(BR) is getting failed for huge table count #58513

kabileshKabi opened this issue Dec 24, 2024 · 15 comments
Labels
component/br This issue is related to BR of TiDB. type/bug The issue is confirmed as a bug.

Comments

@kabileshKabi
Copy link

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Issue Description:
Our application is a SAAS based Multi-tenancy application with each tenant will have a DB , in which we have more than 14k databases and having more than 600k Tables.

While we have a strick backup requirement when we run the BR full backup its not showing any progress and gets failed with some RPC error

Command and Log:

Starting component br: /home/ec2-user/.tiup/components/br/v8.1.1/br backup full --pd 10.1.3.190:2379 --storage s3://us-chat-db-tidb-backup/test_backupdec_232024 --log-file backupdec232024.log
Detail BR log in backupdec232024.log 

Full Backup <..................................................................................................................................................................> 0.00%```


Am also attaching the details logs as well


### 2. What did you expect to see? (Required)

We expect the backup to run completely and consistently

### 3. What did you see instead (Required)
we see backup failures and BR backup is not happening

### 4. What is your TiDB version? (Required)

```mysql> select tidb_version();
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_version()                                                                                                                                                                                                                                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Release Version: v8.1.1
Edition: Community
Git Commit Hash: a7df4f9845d5d6a590c5d45dad0dcc9f21aa8765
Git Branch: HEAD
UTC Build Time: 2024-08-22 05:49:03
GoVersion: go1.21.13
Race Enabled: false
Check Table Before Drop: false
Store: tikv |```[backup_log.txt](https://github.com/user-attachments/files/18239868/backup_log.txt)



@kabileshKabi kabileshKabi added the type/bug The issue is confirmed as a bug. label Dec 24, 2024
@BornChanger
Copy link
Contributor

BornChanger commented Dec 24, 2024

@kabileshKabi thx for reporting the issue! The backup log seems it's truncated. Was the backup task was killed? was the br pod oom?

@BornChanger BornChanger added the component/br This issue is related to BR of TiDB. label Dec 24, 2024
@BornChanger
Copy link
Contributor

BornChanger commented Dec 24, 2024

and your cluster has 600k tables, the backup initialization phase needs some time to load schema. the errors in the log attached are transient instead of fatal. @kabileshKabi

@kabileshKabi
Copy link
Author

@kabileshKabi thx for reporting the issue! The backup log seems it's truncated. Was the backup task was killed? was the br pod oom?

The BR is running in VM node, Backup is not getting killed it just exits out

@kabileshKabi
Copy link
Author

and your cluster has 600k tables, the backup initialization phase needs some time to load schema. the errors in the log attached are transient instead of fatal. @kabileshKabi

This seems to be the issue as it needs to load the schema and its getting a RPC time-out

@kabileshKabi
Copy link
Author

kabileshKabi commented Dec 25, 2024

am sharing the log again:

command execuetion log:

Starting component br: /home/ec2-user/.tiup/components/br/v8.1.1/br backup full --pd 10.1.3.190:2379 --storage s3://us-chat-db-tidb-backup/test_backupdec252024 --log-file backupdec252024.log
Detail BR log in backupdec252024.log


Full Backup <...................................................................................................................................................................> 0.00%

[ec2-user@ip-10-1```

The above command is fired from the tiup node and just exits as above, without printing any error, we just get a Info stating as below. When the command is in execution we also noticed a high memory and CPU usage in the tiup node

```[2024/12/25 04:53:54.515 +00:00] [INFO] [backup.go:491] ["current backup safePoint job"] [safePoint="{ID=br-cfab8f42-1743-412b-b509-4e8661991690,TTL=1h12m0s,BackupTime=\"2024-12-25 04:53:54.48 +0000 UTC\",BackupTS=454846692584325250}"]
[2024/12/25 04:53:54.563 +00:00] [INFO] [manager.go:261] ["break campaign loop, NewSession failed"] ["owner info"="[log-backup] /tidb/br-stream/owner ownerManager 72161622-c91b-4405-857a-a5553d977546"] [error="context canceled"] [errorVerbose="context canceled\ngithub.com/pingcap/errors.AddStack\n\t/root/go/pkg/mod/github.com/pingcap/[email protected]/errors.go:178\ngithub.com/pingcap/errors.Trace\n\t/root/go/pkg/mod/github.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tidb/pkg/util.contextDone\n\t/workspace/source/tidb/pkg/util/etcd.go:90\ngithub.com/pingcap/tidb/pkg/util.NewSession\n\t/workspace/source/tidb/pkg/util/etcd.go:50\ngithub.com/pingcap/tidb/pkg/owner.(*ownerManager).campaignLoop\n\t/workspace/source/tidb/pkg/owner/manager.go:259\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197"]
[2024/12/25 04:53:55.840 +00:00] [INFO] [manager.go:317] ["revoke session"] ["owner info"="[log-backup] /tidb/br-stream/owner ownerManager 72161622-c91b-4405-857a-a5553d977546"] [error="rpc error: code = Canceled desc = grpc: the client connection is closing"]
[2024/12/25 04:54:57.889 +00:00] [INFO] [client.go:531] ["backup empty database"] [db=test]
[2024/12/25 04:55:19.481 +00:00] [INFO] [client.go:531] ["backup empty database"] [db=testplay]
[2024/12/25 04:55:51.300 +00:00] [INFO] [backup.go:578] ["get placement policies"] [count=0]```

Am also here with attaching the full logs for this.

Thank you,[backup_log_new.txt](https://github.com/user-attachments/files/18243431/backup_log_new.txt)


@BornChanger
Copy link
Contributor

BornChanger commented Dec 25, 2024

@kabileshKabi if the backup log is just as it is,i think it's likely the br node (the host of tiup) was oom. can you double check? and if it's the reason, please use higher end vm (say 64GiB memory) to run backup?

@kabileshKabi
Copy link
Author

kabileshKabi commented Dec 25, 2024

@BornChanger is there any way to limit the memory usage of BR

i can see it being killed by OOM in tiup node:
[917931.146960] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-9441.scope,task=br,pid=3006955,uid=1000
[917931.148603] Out of memory: Killed process 3006955 (br) total-vm:14123036kB, anon-rss:11402632kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:22896kB oom_score_adj:0

@kabileshKabi
Copy link
Author

@BornChanger
we had increased the memory of tiup to 32GB and initiated the backup, it was running fine, it had failed as below

Logs:

[2024/12/25 18:22:29.182 +00:00] [INFO] [client.go:842] ["Backup Ranges Completed"] [take=7h28m40.300275186s]
[2024/12/25 18:22:29.182 +00:00] [INFO] [backup.go:674] ["wait for flush checkpoint..."]
[2024/12/25 18:22:29.182 +00:00] [INFO] [checkpoint.go:455] ["stop checkpoint runner"]
[2024/12/25 18:22:29.182 +00:00] [INFO] [checkpoint.go:342] ["stop checkpoint flush worker"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [backup.go:497] ["skip removing gc-safepoint keeper for next retry"] [gc-id=br-3d999896-00a0-4f88-b43a-61662eb7066c]
[2024/12/25 18:22:29.217 +00:00] [INFO] [pd_service_discovery.go:550] ["[pd] exit member loop due to context canceled"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_dispatcher.go:423] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_dispatcher.go:348] ["[tso] exit tso dispatcher"] [dc-location=global]
[2024/12/25 18:22:29.217 +00:00] [INFO] [resource_manager_client.go:290] ["[resource manager] exit resource token dispatcher"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_batch_controller.go:158] ["[pd] clear the tso batch controller"] [max-batch-size=10000] [best-batch-size=1] [collected-request-count=0] [pending-request-count=0]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_client.go:124] ["closing tso client"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_dispatcher.go:182] ["exit tso requests cancel loop"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_dispatcher.go:236] ["exit tso dispatcher loop"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_client.go:129] ["close tso client"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_batch_controller.go:158] ["[pd] clear the tso batch controller"] [max-batch-size=10000] [best-batch-size=1] [collected-request-count=0] [pending-request-count=0]
[2024/12/25 18:22:29.217 +00:00] [INFO] [tso_client.go:139] ["tso client is closed"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [pd_service_discovery.go:637] ["[pd] close pd service discovery client"]
[2024/12/25 18:22:29.217 +00:00] [INFO] [client.go:347] ["[pd] http client closed"] [source=tikv-driver]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_dispatcher.go:182] ["exit tso requests cancel loop"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_dispatcher.go:236] ["exit tso dispatcher loop"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [pd_service_discovery.go:550] ["[pd] exit member loop due to context canceled"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_dispatcher.go:423] ["[tso] stop fetching the pending tso requests due to context canceled"] [dc-location=global]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_dispatcher.go:348] ["[tso] exit tso dispatcher"] [dc-location=global]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_batch_controller.go:158] ["[pd] clear the tso batch controller"] [max-batch-size=10000] [best-batch-size=1] [collected-request-count=0] [pending-request-count=0]
[2024/12/25 18:22:29.218 +00:00] [INFO] [resource_manager_client.go:290] ["[resource manager] exit resource token dispatcher"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_client.go:124] ["closing tso client"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_client.go:129] ["close tso client"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_batch_controller.go:158] ["[pd] clear the tso batch controller"] [max-batch-size=10000] [best-batch-size=1] [collected-request-count=0] [pending-request-count=0]
[2024/12/25 18:22:29.218 +00:00] [INFO] [tso_client.go:139] ["tso client is closed"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [pd_service_discovery.go:637] ["[pd] close pd service discovery client"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [client.go:347] ["[pd] http client closed"] [source="br/lightning PD controller"]
[2024/12/25 18:22:29.218 +00:00] [INFO] [metafile.go:725] ["exit write metas by context done"]
[2024/12/25 18:22:29.413 +00:00] [INFO] [collector.go:224] ["units canceled"] [cancel-unit=0]
2024-12-25 18:23:02.678883393 +0000 UTC m=+27134.120592011 write error: write length 982132449 exceeds maximum file size 314572800
[2024/12/25 18:23:11.121 +00:00] [ERROR] [backup.go:57] ["failed to backup"] [error="at fine-grained backup, remained ranges = 0: backoff exceeds the max backoff time 1h0m0s: [BR:Common:ErrUnknown]internal error"] [errorVerbose="[BR:Common:ErrUnknown]internal error\nbackoff exceeds the max backoff time 1h0m0s\ngithub.com/pingcap/tidb/br/pkg/utils.(*RetryWithBackoffer).BackOff\n\t/workspace/source/tidb/br/pkg/utils/retry.go:348\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).fineGrainedBackup\n\t/workspace/source/tidb/br/pkg/backup/client.go:1210\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/workspace/source/tidb/br/pkg/backup/client.go:958\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges.func2\n\t/workspace/source/tidb/br/pkg/backup/client.go:864\ngithub.com/pingcap/tidb/pkg/util.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/workspace/source/tidb/pkg/util/worker_pool.go:81\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/root/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197\nat fine-grained backup, remained ranges = 0"] [stack="main.runBackupCommand\n\t/workspace/source/tidb/br/cmd/br/backup.go:57\nmain.newFullBackupCommand.func1\n\t/workspace/source/tidb/br/cmd/br/backup.go:148\ngithub.com/spf13/cobra.(*Command).execute\n\t/root/go/pkg/mod/github.com/spf13/[email protected]/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/pkg/mod/github.com/spf13/[email protected]/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/root/go/pkg/mod/github.com/spf13/[email protected]/command.go:1039\nmain.main\n\t/workspace/source/tidb/br/cmd/br/main.go:36\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"]
[2024/12/25 18:23:11.122 +00:00] [ERROR] [main.go:38] ["br failed"] [error="at fine-grained backup, remained ranges = 0: backoff exceeds the max backoff time 1h0m0s: [BR:Common:ErrUnknown]internal error"] [errorVerbose="[BR:Common:ErrUnknown]internal error\nbackoff exceeds the max backoff time 1h0m0s\ngithub.com/pingcap/tidb/br/pkg/utils.(*RetryWithBackoffer).BackOff\n\t/workspace/source/tidb/br/pkg/utils/retry.go:348\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).fineGrainedBackup\n\t/workspace/source/tidb/br/pkg/backup/client.go:1210\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/workspace/source/tidb/br/pkg/backup/client.go:958\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges.func2\n\t/workspace/source/tidb/br/pkg/backup/client.go:864\ngithub.com/pingcap/tidb/pkg/util.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/workspace/source/tidb/pkg/util/worker_pool.go:81\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/root/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1197\nat fine-grained backup, remained ranges = 0"] [stack="main.main\n\t/workspace/source/tidb/br/cmd/br/main.go:38\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"]```

@BornChanger
Copy link
Contributor

BornChanger commented Dec 26, 2024

@kabileshKabi it seems the backup is almost there. please try rerun the same backup command to restart the job which will take advantage of checkpoint first and see if it works.

@3pointer
Copy link
Contributor

From the provided error information:

During the backup phase, the specific ranges failed to back up, and the process could not complete even after more than 1 hour, resulting in a failure.

This type of error is relatively rare. Could you provide more complete backup logs to help us analyze the issue?

We believe the logs in the last hour will contain a large number of start fine grained backup entries, indicating that BR was repeatedly attempting to back up but was unable to complete the process.

We need to review the full request logs to identify which specific range caused the issue.

@kabileshKabi
Copy link
Author

kabileshKabi commented Dec 26, 2024

I have re-initiated the same command @BornChanger , i see its progressing we have kept the GC life time to 48hrs

@3pointer am attaching the collected logs debug_br.log

@BornChanger
Copy link
Contributor

@kabileshKabi in the new log, there are screen snapshot of the shell command. can you please send us the log of the failed backup instead?

@kabileshKabi
Copy link
Author

@BornChanger the above log is for the failed backup only

@BornChanger
Copy link
Contributor

please send us the complete backup log if possible.

@kabileshKabi
Copy link
Author

@BornChanger one more update the backup which we have restarted has been completed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/br This issue is related to BR of TiDB. type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

3 participants