-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backup(BR) is getting failed for huge table count #58513
Comments
@kabileshKabi thx for reporting the issue! The backup log seems it's truncated. Was the backup task was killed? was the br pod oom? |
and your cluster has 600k tables, the backup initialization phase needs some time to load schema. the errors in the log attached are transient instead of fatal. @kabileshKabi |
The BR is running in VM node, Backup is not getting killed it just exits out |
This seems to be the issue as it needs to load the schema and its getting a RPC time-out |
am sharing the log again: command execuetion log:
|
@kabileshKabi if the backup log is just as it is,i think it's likely the br node (the host of tiup) was oom. can you double check? and if it's the reason, please use higher end vm (say 64GiB memory) to run backup? |
@BornChanger is there any way to limit the memory usage of BR i can see it being killed by OOM in tiup node: |
@BornChanger Logs:
|
@kabileshKabi it seems the backup is almost there. please try rerun the same backup command to restart the job which will take advantage of checkpoint first and see if it works. |
From the provided error information: During the backup phase, the specific ranges failed to back up, and the process could not complete even after more than 1 hour, resulting in a failure. This type of error is relatively rare. Could you provide more complete backup logs to help us analyze the issue? We believe the logs in the last hour will contain a large number of We need to review the full request logs to identify which specific range caused the issue. |
I have re-initiated the same command @BornChanger , i see its progressing we have kept the GC life time to 48hrs @3pointer am attaching the collected logs debug_br.log |
@kabileshKabi in the new log, there are screen snapshot of the shell command. can you please send us the log of the failed backup instead? |
@BornChanger the above log is for the failed backup only |
please send us the complete backup log if possible. |
@BornChanger one more update the backup which we have restarted has been completed now |
Bug Report
Please answer these questions before submitting your issue. Thanks!
1. Minimal reproduce step (Required)
Issue Description:
Our application is a SAAS based Multi-tenancy application with each tenant will have a DB , in which we have more than 14k databases and having more than 600k Tables.
While we have a strick backup requirement when we run the BR full backup its not showing any progress and gets failed with some RPC error
Command and Log:
The text was updated successfully, but these errors were encountered: