Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch sstables of similar size #3979

Open
Tracked by #3706
Michal-Leszczynski opened this issue Aug 19, 2024 · 7 comments · May be fixed by #4055
Open
Tracked by #3706

Batch sstables of similar size #3979

Michal-Leszczynski opened this issue Aug 19, 2024 · 7 comments · May be fixed by #4055
Assignees
Labels
enhancement New feature or request restore
Milestone

Comments

@Michal-Leszczynski
Copy link
Collaborator

Assume the following scenario: node with 8 shards, batch with 8 sstables, 1 is really big, 7 are small.
As only a single shard works on load&stream of given sstable, we would end up in a situation where the 7 shards taking care of small sstables would finish their work quickly and wait for the single shard working on a big sstable.
Creating batches of similarly sized sstables would improve load&stream performance.

@mykaul
Copy link
Contributor

mykaul commented Sep 3, 2024

Is this part of 3.3.2?

@Michal-Leszczynski
Copy link
Collaborator Author

No, the 3.3.2 will mostly contain a fix for #3989 and perhaps #4007 and #3995.
I would say that the restore related improvements are big enough for 3.4, but if there is a need to keep it as a patch release, it could be 3.3.3.

@karol-kokoszka karol-kokoszka reopened this Sep 9, 2024
@karol-kokoszka
Copy link
Collaborator

The change is going to affect the meaning of batch-size flag when the value is set to 0.
0 is going to be wild-card meaning that batch size is expected to be 5% of the all data that would be sent to the node (basing on the following calculation -> (amount_of_restore_data * number_of_node_shards) / number_of_all_shards ).

@karol-kokoszka karol-kokoszka changed the title Batch sstables of similar size Batch sstables basing on percentage of the node's data Sep 9, 2024
Michal-Leszczynski added a commit that referenced this issue Sep 25, 2024
This is a temporary implementation used for integrating workload
indexing with the rest of the code. It will be improved as a part
of the #3979.
Michal-Leszczynski added a commit that referenced this issue Sep 25, 2024
This is a temporary implementation used for integrating workload
indexing with the rest of the code. It will be improved as a part
of the #3979.
Michal-Leszczynski added a commit that referenced this issue Sep 26, 2024
This is a temporary implementation used for integrating workload
indexing with the rest of the code. It will be improved as a part
of the #3979.
Michal-Leszczynski added a commit that referenced this issue Sep 26, 2024
This is a temporary implementation used for integrating workload
indexing with the rest of the code. It will be improved as a part
of the #3979.
Michal-Leszczynski added a commit that referenced this issue Sep 27, 2024
This is a temporary implementation used for integrating workload
indexing with the rest of the code. It will be improved as a part
of the #3979.
Michal-Leszczynski added a commit that referenced this issue Oct 2, 2024
This commit allows to set --batch-size=0.
When this happens, batches will be created so that they contain
about 5% of expected node workload during restore.
This allows for creating big, yet evenly distributed batches
without the need to play with the --batch-size flag.
It should also work better fine when backed up cluster
had different amount of nodes than the restore destination
cluster.

Fixes #3979
@Michal-Leszczynski Michal-Leszczynski linked a pull request Oct 2, 2024 that will close this issue
Michal-Leszczynski added a commit that referenced this issue Oct 3, 2024
This is a temporary implementation used for integrating workload
indexing with the rest of the code. It will be improved as a part
of the #3979.
Michal-Leszczynski added a commit that referenced this issue Oct 3, 2024
* feat(schema): drop sequential restore run tracking

Right now SM restores location by location, manifest by manifest, table by table.
That's why it tracks restore progress by keeping location/manifest/table in the DB.
We are moving away from sequential restore approach in favor of
restoring from all locations/manifests/tables at the same time.

* feat(restore): adjust model to dropped sequential restore run tracking

* refactor(backupspec): include the newest file version in ListVersionedFiles

There is no need to iterate versioned files (ListVersionedFiles) and not versioned files (buildFilesSizesCache) separately.
Doing it in a single iteration is faster, and it allows to store all size information in a single place.

* feat(restore): add workload indexing

This commit introduces the structure of restore workload.
Workload is divided per location->table->remote sstable directory.
This changes the hierarchy established by manifests (location->node->table->remote sstable dir).
It also aggregates files into actual sstables, extracts their IDs, and aggregates their sizes,
and keeps track of sstable versioning.

* feat(restore): index, support resume

Indexed workload won't contain sstables that were already
restored during previous restore run.

* feat(restore): index, support metrics init

* feat(restore): add primitive batching using indexed workload

This is a temporary implementation used for integrating workload
indexing with the rest of the code. It will be improved as a part
of the #3979.

* feat(restore): integrate new indexing and batching with codebase

This commit makes use of the new indexing and batching
approaches and uses them in the restore tables codebase.

* fix(restore): handle download of fully versioned batch

Recent commits changed versioned batch download so that
if any sstable component is versioned, then all sstable components
are downloaded as versioned files.
It was done in that way to allow easier versioned progress calculation
(we don't store per file size, only the whole sstable size).

This brought to light a bug (that existed before, but was more difficult to hit),
in which restoring batch failed when the whole batch was versioned,
as calling RcloneSyncCopyPaths on empty paths parameter resulted in broken download.

We could just skip the RcloneSyncCopyPaths call when the whole batch is versioned,
but this would leave us without the agentJobID which is a part of sort key
in RestoreRunProgress. Without it, we could potentially overwrite one
restore run progress with another - if both of them happened on the same
RemoteSSTableDir, by the same Host, and were fully versioned.
It would also introduce a different path for restoring regular batch and fully versioned batch,
which is not desirable.

That's why I decided to modify rclone server to allow empty path parameter,
so that it still generates agentJobID, but it doesn't do anything except for that.

* feat(restore): index, log workload info

Workload info contains location/table/remote sstable dir sstable count,
total size, max and average sstable size.
Michal-Leszczynski added a commit that referenced this issue Oct 3, 2024
This commit allows to set --batch-size=0.
When this happens, batches will be created so that they contain
about 5% of expected node workload during restore.
This allows for creating big, yet evenly distributed batches
without the need to play with the --batch-size flag.
It should also work better fine when backed up cluster
had different amount of nodes than the restore destination
cluster.

Fixes #3979
@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka do we want to make --batch-size=0 default in SM 3.4?
If I were to choose the default values from the scratch, I would choose in order to restore as fast as possible - as this is the most common case. On the other hand, changing the default value for the flag might break compatibility in other projects.

Michal-Leszczynski added a commit that referenced this issue Oct 3, 2024
This commit allows to set --batch-size=0.
When this happens, batches will be created so that they contain
about 5% of expected node workload during restore.
This allows for creating big, yet evenly distributed batches
without the need to play with the --batch-size flag.
It should also work better fine when backed up cluster
had different amount of nodes than the restore destination
cluster.

Fixes #3979
@karol-kokoszka
Copy link
Collaborator

@Michal-Leszczynski please leave the current default.

@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka so we can close #3809 then?

@Michal-Leszczynski Michal-Leszczynski changed the title Batch sstables basing on percentage of the node's data Batch sstables of similar size Oct 3, 2024
Michal-Leszczynski added a commit that referenced this issue Oct 3, 2024
This results in creating batches of sstables of
more similar size.

Fixes #3979
@karol-kokoszka
Copy link
Collaborator

Let's bring it to the planning. IMHO we should keep the default, but let's discuss on planning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request restore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants