Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store bandwidth characteristic of Manager restore process #4042

Closed
mikliapko opened this issue Sep 25, 2024 · 5 comments · Fixed by #4088
Closed

Store bandwidth characteristic of Manager restore process #4042

mikliapko opened this issue Sep 25, 2024 · 5 comments · Fixed by #4088
Assignees
Labels
enhancement New feature or request restore
Milestone

Comments

@mikliapko
Copy link

Since evaluating the restore process performance by bandwidth seems like the most reasonable approach - especially when comparing across different clusters and datasets - it makes sense to store and display this data in Manager.

It would be a great improvement for restore benchmarking tests.

@karol-kokoszka
Copy link
Collaborator

karol-kokoszka commented Sep 30, 2024

The suggestion is to include bandwidth of the restore into the sctool progress.
The current output is (example from other issue):
image

To achieve the goal of this task, we need to store an information about the duration of load&stream, download and idle. Right now, it's only indicated by the scylla_manager_restore_state metric.

The restore_progress that is saved to DB is of the following form

type RunProgress struct {
	ClusterID uuid.UUID
	TaskID    uuid.UUID
	RunID     uuid.UUID

	ManifestPath string
	Keyspace     string `db:"keyspace_name"`
	Table        string `db:"table_name"`
	Host         string // IP of the node to which SSTables are downloaded.
	AgentJobID   int64

	SSTableID           []string `db:"sstable_id"`
	DownloadStartedAt   *time.Time
	DownloadCompletedAt *time.Time
	RestoreStartedAt    *time.Time
	RestoreCompletedAt  *time.Time
	Error               string
	Downloaded          int64
	Skipped             int64
	Failed              int64
	VersionedProgress   int64
}

It looks that we already store the needed data in DB. It's per batch of the sstables (SSTableID column), and we store an information of when the download has started/ended (DownloadStartedAt/DownloadCompletedAt), when the l&s started/ended (RestoreStartedAt, RestoreCompletedAt) and what was the host owning this batch (Host). Besides, we save info of how much of the bytes was downloaded/restored in downloaded column.
It gives us a possibility of querying the "restore_run_progress" table when the restore completes to create a map for every host that includes how much time it took for a given host to download and load&stream. The idle state is the duration - (download + load&stream).
Then we have formula:

downloaded (per host) / duration of the download (DownloadedCompletedAt - DownloadedStartedAt)
downloaded (per host) / duration of the load&stream (RestoreCompletedAt - RestoreStartedAt)

We can show the idle time in the summary as well by outputing (per host)

Restore duration - (time reported as download) - (time reported as load & stream)

The output can be:

<host>:
   download: BW
   load&stream: BW
   idle: time

As the output may be quite big for x nodes cluster, we can include it into the restore progress with the --details flag.
https://manager.docs.scylladb.com/stable/sctool/progress.html#details

@karol-kokoszka
Copy link
Collaborator

Possibly it's worth to include it into 3.4 release.

@dorlaor
Copy link

dorlaor commented Oct 14, 2024

Nice!

@mykaul
Copy link
Contributor

mykaul commented Oct 14, 2024

Isn't it why God created Metrics and we implemented Monitoring?

@Michal-Leszczynski
Copy link
Collaborator

I will add it in both metrics and progress display.

Michal-Leszczynski added a commit that referenced this issue Oct 26, 2024
It's useful for checking/tracking restore performance.

Ref #4042
Michal-Leszczynski added a commit that referenced this issue Oct 26, 2024
It's useful for checking/tracking restore performance.

Ref #4042
Michal-Leszczynski added a commit that referenced this issue Oct 28, 2024
It's useful for checking/tracking restore performance.

Ref #4042
Michal-Leszczynski added a commit that referenced this issue Oct 28, 2024
It's useful for checking/tracking restore performance.

Ref #4042
Michal-Leszczynski added a commit that referenced this issue Oct 30, 2024
* refactor(restore): separate methods for updating metrics/progress

This should make it easier to see what is updated where and when.

* feat(metrics): restore, add bandwidth metrics

They are really useful for evaluating restore performance.

* feat(restore): set download/stream bytes/duration metrics

It's useful for checking/tracking restore performance.

Ref #4042

* fix(restore): don't initialize metrics twice

This was a left-over from the PR introducing
indexing (14aef7b). It also initialized metrics
as a part of the indexing procedure, but it
forgot to remove the previous metrics initialization
from the code.

* fix(restore): use backup bluster ID in remaining_bytes metric

There was a confusion about which cluster ID should
be used for labeling remaining_bytes metric.
When setting remaining_bytes, we used backup cluster ID,
but when decreasing, we used restore cluster ID.
Backup cluster ID should be used in both places
as this metrics describes how many bytes from
which place are yet to be restored. Since we use backup
cluster DC, node ID, etc., we should also use backup
cluster ID.
karol-kokoszka pushed a commit that referenced this issue Nov 4, 2024
* refactor(restore): separate methods for updating metrics/progress

This should make it easier to see what is updated where and when.

* feat(metrics): restore, add bandwidth metrics

They are really useful for evaluating restore performance.

* feat(restore): set download/stream bytes/duration metrics

It's useful for checking/tracking restore performance.

Ref #4042

* fix(restore): don't initialize metrics twice

This was a left-over from the PR introducing
indexing (14aef7b). It also initialized metrics
as a part of the indexing procedure, but it
forgot to remove the previous metrics initialization
from the code.

* fix(restore): use backup bluster ID in remaining_bytes metric

There was a confusion about which cluster ID should
be used for labeling remaining_bytes metric.
When setting remaining_bytes, we used backup cluster ID,
but when decreasing, we used restore cluster ID.
Backup cluster ID should be used in both places
as this metrics describes how many bytes from
which place are yet to be restored. Since we use backup
cluster DC, node ID, etc., we should also use backup
cluster ID.
karol-kokoszka pushed a commit that referenced this issue Nov 4, 2024
Restore: add and fill host info in restore progress

* chore(go.mod): remove replace directive to SM submodules

It was a left-over after feature development:/

* chore(go.mod): bump SM submodules deps

* feat(schema): add shard cnt to restore_run_progress

It's going to be needed for calculating per shard
download/stream bandwidth in progress command.

* feat(restore): add and fill shard cnt in restore run progress

This commit also moves host shard info to the tablesWorker,
as it is commonly reused during restore procedure.

* feat(restore): add and fill host info in progress

This allows to calculate download/stream per shard
bandwidth in 'sctool progress' display.

* feat(managerclient): display bandwidth in sctool progress

Fixes #4042

* feat(managerclient): include B or iB in SizeSuffix display

It is nicer to see:
"Size: 10B" instead of "Size: 10" or
"Size: 20KiB" instead of "Size: 20k".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request restore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants