Store bandwidth characteristic of Manager restore process #4042

mikliapko · 2024-09-25T10:28:35Z

Since evaluating the restore process performance by bandwidth seems like the most reasonable approach - especially when comparing across different clusters and datasets - it makes sense to store and display this data in Manager.

It would be a great improvement for restore benchmarking tests.

karol-kokoszka · 2024-09-30T09:52:59Z

The suggestion is to include bandwidth of the restore into the sctool progress.
The current output is (example from other issue):

To achieve the goal of this task, we need to store an information about the duration of load&stream, download and idle. Right now, it's only indicated by the scylla_manager_restore_state metric.

The restore_progress that is saved to DB is of the following form

type RunProgress struct {
	ClusterID uuid.UUID
	TaskID    uuid.UUID
	RunID     uuid.UUID

	ManifestPath string
	Keyspace     string `db:"keyspace_name"`
	Table        string `db:"table_name"`
	Host         string // IP of the node to which SSTables are downloaded.
	AgentJobID   int64

	SSTableID           []string `db:"sstable_id"`
	DownloadStartedAt   *time.Time
	DownloadCompletedAt *time.Time
	RestoreStartedAt    *time.Time
	RestoreCompletedAt  *time.Time
	Error               string
	Downloaded          int64
	Skipped             int64
	Failed              int64
	VersionedProgress   int64
}

It looks that we already store the needed data in DB. It's per batch of the sstables (SSTableID column), and we store an information of when the download has started/ended (DownloadStartedAt/DownloadCompletedAt), when the l&s started/ended (RestoreStartedAt, RestoreCompletedAt) and what was the host owning this batch (Host). Besides, we save info of how much of the bytes was downloaded/restored in downloaded column.
It gives us a possibility of querying the "restore_run_progress" table when the restore completes to create a map for every host that includes how much time it took for a given host to download and load&stream. The idle state is the duration - (download + load&stream).
Then we have formula:

downloaded (per host) / duration of the download (DownloadedCompletedAt - DownloadedStartedAt)
downloaded (per host) / duration of the load&stream (RestoreCompletedAt - RestoreStartedAt)

We can show the idle time in the summary as well by outputing (per host)

Restore duration - (time reported as download) - (time reported as load & stream)

The output can be:

<host>:
   download: BW
   load&stream: BW
   idle: time

As the output may be quite big for x nodes cluster, we can include it into the restore progress with the --details flag.
https://manager.docs.scylladb.com/stable/sctool/progress.html#details

karol-kokoszka · 2024-09-30T09:53:34Z

Possibly it's worth to include it into 3.4 release.

dorlaor · 2024-10-14T08:28:43Z

Nice!

mykaul · 2024-10-14T12:09:48Z

Isn't it why God created Metrics and we implemented Monitoring?

Michal-Leszczynski · 2024-10-22T10:50:32Z

I will add it in both metrics and progress display.

It's useful for checking/tracking restore performance. Ref #4042

Fixes #4042

* refactor(restore): separate methods for updating metrics/progress This should make it easier to see what is updated where and when. * feat(metrics): restore, add bandwidth metrics They are really useful for evaluating restore performance. * feat(restore): set download/stream bytes/duration metrics It's useful for checking/tracking restore performance. Ref #4042 * fix(restore): don't initialize metrics twice This was a left-over from the PR introducing indexing (14aef7b). It also initialized metrics as a part of the indexing procedure, but it forgot to remove the previous metrics initialization from the code. * fix(restore): use backup bluster ID in remaining_bytes metric There was a confusion about which cluster ID should be used for labeling remaining_bytes metric. When setting remaining_bytes, we used backup cluster ID, but when decreasing, we used restore cluster ID. Backup cluster ID should be used in both places as this metrics describes how many bytes from which place are yet to be restored. Since we use backup cluster DC, node ID, etc., we should also use backup cluster ID.

Fixes #4042

* refactor(restore): separate methods for updating metrics/progress This should make it easier to see what is updated where and when. * feat(metrics): restore, add bandwidth metrics They are really useful for evaluating restore performance. * feat(restore): set download/stream bytes/duration metrics It's useful for checking/tracking restore performance. Ref #4042 * fix(restore): don't initialize metrics twice This was a left-over from the PR introducing indexing (14aef7b). It also initialized metrics as a part of the indexing procedure, but it forgot to remove the previous metrics initialization from the code. * fix(restore): use backup bluster ID in remaining_bytes metric There was a confusion about which cluster ID should be used for labeling remaining_bytes metric. When setting remaining_bytes, we used backup cluster ID, but when decreasing, we used restore cluster ID. Backup cluster ID should be used in both places as this metrics describes how many bytes from which place are yet to be restored. Since we use backup cluster DC, node ID, etc., we should also use backup cluster ID.

Restore: add and fill host info in restore progress * chore(go.mod): remove replace directive to SM submodules It was a left-over after feature development:/ * chore(go.mod): bump SM submodules deps * feat(schema): add shard cnt to restore_run_progress It's going to be needed for calculating per shard download/stream bandwidth in progress command. * feat(restore): add and fill shard cnt in restore run progress This commit also moves host shard info to the tablesWorker, as it is commonly reused during restore procedure. * feat(restore): add and fill host info in progress This allows to calculate download/stream per shard bandwidth in 'sctool progress' display. * feat(managerclient): display bandwidth in sctool progress Fixes #4042 * feat(managerclient): include B or iB in SizeSuffix display It is nicer to see: "Size: 10B" instead of "Size: 10" or "Size: 20KiB" instead of "Size: 20k".

mikliapko added the refinement-needed label Sep 25, 2024

karol-kokoszka added ready-for-development and removed refinement-needed labels Sep 30, 2024

karol-kokoszka added this to the 3.4 milestone Sep 30, 2024

Michal-Leszczynski self-assigned this Sep 30, 2024

Michal-Leszczynski added enhancement New feature or request restore labels Sep 30, 2024

karol-kokoszka removed the ready-for-development label Sep 30, 2024

Michal-Leszczynski mentioned this issue Oct 9, 2024

Update dry-run display with restore improvement params #4066

Closed

Michal-Leszczynski added a commit that referenced this issue Oct 26, 2024

feat(restore): set download/stream bytes/duration metrics

b0dfcbc

It's useful for checking/tracking restore performance. Ref #4042

Michal-Leszczynski added a commit that referenced this issue Oct 26, 2024

feat(restore): set download/stream bytes/duration metrics

98faa03

It's useful for checking/tracking restore performance. Ref #4042

This was referenced Oct 26, 2024

Restore: add bandwidth metrics #4081

Merged

Swagger: add bandwidth to restore progress #4082

Merged

Michal-Leszczynski added a commit that referenced this issue Oct 28, 2024

feat(restore): set download/stream bytes/duration metrics

369713d

It's useful for checking/tracking restore performance. Ref #4042

Michal-Leszczynski added a commit that referenced this issue Oct 28, 2024

feat(restore): set download/stream bytes/duration metrics

2ba54e2

It's useful for checking/tracking restore performance. Ref #4042

Michal-Leszczynski mentioned this issue Oct 29, 2024

Restore: add and fill host info in restore progress #4088

Merged

Michal-Leszczynski added a commit that referenced this issue Oct 29, 2024

feat(managerclient): display bandwidth in sctool progress

2faa640

Fixes #4042

Michal-Leszczynski added a commit that referenced this issue Oct 30, 2024

feat(managerclient): display bandwidth in sctool progress

39365c2

Fixes #4042

Michal-Leszczynski added a commit that referenced this issue Oct 30, 2024

feat(managerclient): display bandwidth in sctool progress

b9db113

Fixes #4042

Michal-Leszczynski added a commit that referenced this issue Oct 31, 2024

feat(managerclient): display bandwidth in sctool progress

4a5ede4

Fixes #4042

Michal-Leszczynski added a commit that referenced this issue Nov 4, 2024

feat(managerclient): display bandwidth in sctool progress

4f72868

Fixes #4042

karol-kokoszka pushed a commit that referenced this issue Nov 4, 2024

feat(managerclient): display bandwidth in sctool progress

7c8c070

Fixes #4042

karol-kokoszka closed this as completed in #4088 Nov 4, 2024

karol-kokoszka closed this as completed in aba3560 Nov 4, 2024

mikliapko mentioned this issue Nov 12, 2024

Improve restore benchmark Argus charts with new restore metrics #4108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store bandwidth characteristic of Manager restore process #4042

Store bandwidth characteristic of Manager restore process #4042

mikliapko commented Sep 25, 2024

karol-kokoszka commented Sep 30, 2024 •

edited

Loading

karol-kokoszka commented Sep 30, 2024

dorlaor commented Oct 14, 2024

mykaul commented Oct 14, 2024

Michal-Leszczynski commented Oct 22, 2024

Store bandwidth characteristic of Manager restore process #4042

Store bandwidth characteristic of Manager restore process #4042

Comments

mikliapko commented Sep 25, 2024

karol-kokoszka commented Sep 30, 2024 • edited Loading

karol-kokoszka commented Sep 30, 2024

dorlaor commented Oct 14, 2024

mykaul commented Oct 14, 2024

Michal-Leszczynski commented Oct 22, 2024

karol-kokoszka commented Sep 30, 2024 •

edited

Loading