Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore improvements #3986

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from
Draft

Restore improvements #3986

wants to merge 15 commits into from

Conversation

Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Aug 20, 2024

Based on #3956.

This PR introduces 3 big changes:

  1. Indexing
    Restore task no longer iterates over manifests, and restores all the files from given manifest before proceeding to the next one.
    The problem with this approach was that it limited the pool of sstables used for creating batches and it resulted in nodes spending more time in the idle state in between restored manifests.
    Now, SM indexes all of the files from all manifests at the beginning, and then creates batches from all of them, without the need to finish one manifest before another.

  2. Batching.
    Restore task no longer batches files based on --batch-size flag.
    The problem with this approach was that it didn't take shard_cnt or sstable size into consideration. It was also difficult to guess the right value for this flag without knowing backup statistics.
    Now, SM creates batches containing multiple of node shard_cnt sstables. It keeps on adding the biggest (not yet restored) sstables to the batch until it reaches size of 5% of the expected node workload (total_workload/total_shard_cnt_in_the_cluster*node_shard_cnt).

  3. Additional, testing flags.
    Those flags are there just for the testing purposes. Some of them will be removed, some might stay, depending on how useful we find them:

  • --table-parallel - how many tables should be restored from a single node at a single time (might be useful for many smaller tables)
  • --stream-to-all-replicas - runs l&s without the primary_replica_only option and skips the post restore repair (might be useful in a cluster with small RF and big amount of replica sets)
  • --unpin-agent-cpu - unpins agent from cpus for the time of the restore (it looks like it increases download speed to some extent)

This decreases main function complexity.
Noticed by: cognitive complexity 66 of func `(*tablesWorker).restore` is high (> 50) (gocognit).
…abled and enabled

As a preparation for restoring data, SM should disable tombstone_gc and compaction.
They should be re-enabled after the restore finishes.
The idea is to first index all files to be restored, so that we can create better batches. Restore workload is aggregated first by table, then by remote sstable dir.

From batching we expect:
- batch contains X*shard_cnt sstables
- batch contains similarly sized sstables
- batch is created from any manifest/table - no waiting for manifest/table restore to finish
- workload across different nodes is evenly distributed
- --batch-size is ignored, batches are aiming to have size equal to 5% of expected node workload

New flag --table-parallel - it allows for running multiple download and l&s jobs from the same node (no documentation as it might not be exposed later on).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant