feat(1-1-restore): validates if source and target clusters nodes are equal #4230

VAveryanov8 · 2025-01-27T11:34:53Z

This adds a validation stage for 1-1-restore. The logic is as follows:

Collect node information for the source cluster from backup manifests
Collect node information for the target cluster from the Scylla API
Apply node mappings to the source cluster nodes
Compare each source node with its corresponding target node

Fixes: #4201

Please make sure that:

Code is split to commits that address a single change
Commit messages are informative
Commit titles have module prefix
Commit titles have issue nr. suffix

pkg/service/one2onerestore/worker_manifest.go

pkg/service/one2onerestore/worker_validate.go

…equal This adds a validation stage for 1-1-restore. The logic is as follows: - Collect node information for the source cluster from backup manifests - Collect node information for the target cluster from the Scylla API - Apply node mappings to the source cluster nodes - Compare each source node with its corresponding target node Fixes: #4201

This changes following parts of validation process: - Moves path.Join("backup", string(MetaDirKind)) to `backupspec` pkg - Moves getManifestContext to worker_manifest - Adds SourceClusterID validation to getManifestInfo - Simplifies how nodes info is collected by leveraging node mappings (maps manifests to nodes by host id) - Replaces LocationInfo struct with manifests and hosts - Sorts node tokens

Michal-Leszczynski

The validation process looks way cleaner now:)

Michal-Leszczynski · 2025-01-31T10:30:22Z

pkg/service/one2onerestore/worker.go

+		nodesCountSet = strset.New()
+	)
+	for _, location := range target.Location {
+		// Ignore location.DC because all mappings should be specified via nodes-mapping file


That's a good direction, but then we shouldn't allow user to specify location DC in the first place.
Or, we can still treat it as a hint if it simplifies the implementation, although I'm not sure of that.

Guys, why not to revert the logic here ?
Instead of iterating over locations, iterate over target nodes to check if they are able to access the location keeping their expected SSTables ? It must have the access.

Actually, please ensure that each node have the access to the corresponding location.

Michal-Leszczynski · 2025-01-31T10:36:46Z

pkg/service/one2onerestore/worker.go

+		}
+	}
+
+	nodesWithAccessCount := len(nodesCountSet.List())


nit: you can use nodesCountSet.Size().

Michal-Leszczynski · 2025-01-31T10:43:59Z

pkg/service/one2onerestore/worker_validate.go

+	return nil
+}
+
+func checkOne2OneRestoreCompatiblity(sourceNodeInfo, targetNodeInfo []nodeValidationInfo, nodeMappings []nodeMapping) error {


typo: checkOne2OneRestoreCompatiblity -> checkOne2OneRestoreCompatibility

Michal-Leszczynski · 2025-01-31T10:57:22Z

pkg/service/one2onerestore/worker_validate.go

+	DC          string
+	Rack        string
+	HostID      string
+	CPUCount    int64


Shouldn't it be shard count instead of cpu count?

Agree, it should be shards.
CPU defines amount of available CPUs on the machine.
Shards are defining how many CPUs are used by Scylla.

karol-kokoszka

Few comments.
One is about changing the logic - to iterate over target nodes instead of locations. This is the most important for me.

karol-kokoszka · 2025-02-04T11:05:20Z

pkg/service/one2onerestore/model.go

@@ -28,10 +28,17 @@ type nodeMapping struct {

 type node struct {
 	DC     string `json:"dc"`
-	Rack   string `json:"rack"`
+	Rack   string `json:"rack_id"`


It's a rack name not ID.
Example: https://github.com/scylladb/scylladb/blob/master/conf/cassandra-rackdc.properties

karol-kokoszka · 2025-02-04T11:16:13Z

pkg/service/one2onerestore/worker.go

+	logger log.Logger
+}
+
+// getManifestsAndHosts checks that each host in target cluster should have an access to at least one target location and fetches manifest info.


This comment is a bit confusing.
Each node must have the access to the location where its SSTables are stored.

1-1 restore provides mapping between source node and the target node.

Location defines the DC. If the DC is empty then it means (or should mean) that all DCs are here.
See https://manager.docs.scylladb.com/stable/backup#meta

The method must check that all nodes that are expected to restore particular DC have an access to the location which keeps SSTables of this DC.
The nodes-mapping for 1-1 restore defines which node is going to restore data from which DC.

karol-kokoszka · 2025-02-04T11:19:44Z

pkg/service/one2onerestore/worker.go

+		nodesCountSet = strset.New()
+	)
+	for _, location := range target.Location {
+		// Ignore location.DC because all mappings should be specified via nodes-mapping file


Guys, why not to revert the logic here ?
Instead of iterating over locations, iterate over target nodes to check if they are able to access the location keeping their expected SSTables ? It must have the access.

Actually, please ensure that each node have the access to the corresponding location.

karol-kokoszka · 2025-02-04T11:32:13Z

pkg/service/one2onerestore/worker.go

+	if len(allManifests) != nodesWithAccessCount || len(allManifests) != len(nodeStatus) {
+		return nil, nil, fmt.Errorf("manifest count (%d) != target nodes (%d)", len(allManifests), len(nodesCountSet.List()))
+	}
+
+	return allManifests, nodesToHosts(nodeStatus), nil
+}


This logic will be completely different when you iterate over nodes instead of locations.

After ensuring that node has the access to the location, you just download corresponding manifest to check details like numer of shards and token ring.

After ensuring that node has the access to the location, you just download corresponding manifest to check details like numer of shards and token ring.

That's exactly what is happening inside validateCluster method.

A few word in general about my implementation and what was the reasoning behind it:
In a nutshell we have a source cluster (backup) which nodes are represented by manifests file in the backup location(s) and a target cluster which nodes are actual live nodes in the cluster we want to restore data.

Get source cluster nodes and target cluster nodes. (getManifestsAndHosts)

Collect information needed to compare sources and target cluster nodes. (collectNodeValidationInfo) (here I iterate over nodes)

Compare them using node mapping as rule how to match source cluster node and target cluster node. (checkOne2OneRestoreCompatibility)

If validation has passed successfully, then I can use nodes info from step 1 farther in the code, as I know - it's valid for 1-1-restore and each node has exact match.
Keeping the logic this way give us ability to keep validation logic in one place, without spreading it all across other parts of the code.

This logic will be completely different when you iterate over nodes instead of locations.

Here is how I see this logic

find node dc by looking at nodes mappings

find corresponding location by checking location.DC with DC from step 1

Download manifest content. Two steps actually list dir and then download, because manifest path contains taskID which we don't know (or do we?)

collect additional node info (token ring and etc)

compare nodes info with manifest content.

Without getting into the details, the main difference between two solutions, is that in first we collect all the info and then do the validation, in second, we validate nodes one by one.

For me it's more a less the same, but if you prefer second over first, let me know and i'm gonna change this pr

karol-kokoszka · 2025-02-04T11:33:18Z

pkg/service/one2onerestore/worker_validate.go

+	DC          string
+	Rack        string
+	HostID      string
+	CPUCount    int64


Agree, it should be shards.
CPU defines amount of available CPUs on the machine.
Shards are defining how many CPUs are used by Scylla.

karol-kokoszka · 2025-02-04T11:36:16Z

pkg/service/one2onerestore/worker_validate_test.go

+func TestMapTargetHostToSource(t *testing.T) {
+	testCases := []struct {
+		name string
+
+		nodeMappings []nodeMapping
+		targetHosts  []Host
+		expected     map[string]Host
+		expectedErr  bool
+	}{
+		{
+			name: "All hosts have mappings",
+			nodeMappings: []nodeMapping{


These tests are OK.
But you should verify it bit broader way.
Please keep validateClusters as the method you test. Most likely it will require to use integration tests instead of this ones.

This replaces CPUCount with ShardCount for clusters comparision. Fixes typo in function name.

VAveryanov8 marked this pull request as ready for review January 27, 2025 17:30

VAveryanov8 requested review from karol-kokoszka and Michal-Leszczynski as code owners January 27, 2025 17:30

Michal-Leszczynski reviewed Jan 28, 2025

View reviewed changes

pkg/service/one2onerestore/worker_manifest.go Outdated Show resolved Hide resolved

Michal-Leszczynski reviewed Jan 28, 2025

View reviewed changes

pkg/service/one2onerestore/worker_validate.go Show resolved Hide resolved

VAveryanov8 force-pushed the va/fast-restore-part-2 branch from 37409c0 to 7e9f877 Compare January 29, 2025 14:35

VAveryanov8 added 4 commits January 29, 2025 16:37

feat(1-1-restore): adds unit tests

c091f16

fix: replaces usage of fmt with pkg/errors

2e419bf

VAveryanov8 force-pushed the va/1-1-restore-validation branch from 022de10 to d7ad144 Compare January 29, 2025 15:39

VAveryanov8 requested a review from Michal-Leszczynski January 29, 2025 15:46

fix(1-1-restore): fixes json tag for the nodeMapping.Rack

5d45dd6

Michal-Leszczynski reviewed Jan 31, 2025

View reviewed changes

karol-kokoszka reviewed Feb 4, 2025

View reviewed changes

VAveryanov8 added 3 commits February 5, 2025 09:39

fix(1-1-restore): replaces cpu count with shard count + typo fix.

8ec8cf7

This replaces CPUCount with ShardCount for clusters comparision. Fixes typo in function name.

fixes(1-1-restore): renames rack_id json tag into rack

578b071

chore(tests): updates golden files for 1-1-restore tests

ac304e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(1-1-restore): validates if source and target clusters nodes are equal #4230

feat(1-1-restore): validates if source and target clusters nodes are equal #4230

VAveryanov8 commented Jan 27, 2025

Michal-Leszczynski left a comment

Michal-Leszczynski Jan 31, 2025

karol-kokoszka Feb 4, 2025

Michal-Leszczynski Jan 31, 2025

Michal-Leszczynski Jan 31, 2025

Michal-Leszczynski Jan 31, 2025

karol-kokoszka Feb 4, 2025

karol-kokoszka left a comment

karol-kokoszka Feb 4, 2025

karol-kokoszka Feb 4, 2025

karol-kokoszka Feb 4, 2025

karol-kokoszka Feb 4, 2025

VAveryanov8 Feb 5, 2025 •

edited

Loading

karol-kokoszka Feb 4, 2025

karol-kokoszka Feb 4, 2025

feat(1-1-restore): validates if source and target clusters nodes are equal #4230

Are you sure you want to change the base?

feat(1-1-restore): validates if source and target clusters nodes are equal #4230

Conversation

VAveryanov8 commented Jan 27, 2025

Michal-Leszczynski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karol-kokoszka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VAveryanov8 Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VAveryanov8 Feb 5, 2025 •

edited

Loading