-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(1-1-restore): validates if source and target clusters nodes are equal #4230
base: va/fast-restore-part-2
Are you sure you want to change the base?
Conversation
37409c0
to
7e9f877
Compare
…equal This adds a validation stage for 1-1-restore. The logic is as follows: - Collect node information for the source cluster from backup manifests - Collect node information for the target cluster from the Scylla API - Apply node mappings to the source cluster nodes - Compare each source node with its corresponding target node Fixes: #4201
This changes following parts of validation process: - Moves path.Join("backup", string(MetaDirKind)) to `backupspec` pkg - Moves getManifestContext to worker_manifest - Adds SourceClusterID validation to getManifestInfo - Simplifies how nodes info is collected by leveraging node mappings (maps manifests to nodes by host id) - Replaces LocationInfo struct with manifests and hosts - Sorts node tokens
022de10
to
d7ad144
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation process looks way cleaner now:)
nodesCountSet = strset.New() | ||
) | ||
for _, location := range target.Location { | ||
// Ignore location.DC because all mappings should be specified via nodes-mapping file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good direction, but then we shouldn't allow user to specify location DC in the first place.
Or, we can still treat it as a hint if it simplifies the implementation, although I'm not sure of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, why not to revert the logic here ?
Instead of iterating over locations, iterate over target nodes to check if they are able to access the location keeping their expected SSTables ? It must have the access.
Actually, please ensure that each node have the access to the corresponding location.
pkg/service/one2onerestore/worker.go
Outdated
} | ||
} | ||
|
||
nodesWithAccessCount := len(nodesCountSet.List()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you can use nodesCountSet.Size()
.
return nil | ||
} | ||
|
||
func checkOne2OneRestoreCompatiblity(sourceNodeInfo, targetNodeInfo []nodeValidationInfo, nodeMappings []nodeMapping) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: checkOne2OneRestoreCompatiblity
-> checkOne2OneRestoreCompatibility
DC string | ||
Rack string | ||
HostID string | ||
CPUCount int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be shard count instead of cpu count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it should be shards.
CPU defines amount of available CPUs on the machine.
Shards are defining how many CPUs are used by Scylla.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments.
One is about changing the logic - to iterate over target nodes instead of locations. This is the most important for me.
pkg/service/one2onerestore/model.go
Outdated
@@ -28,10 +28,17 @@ type nodeMapping struct { | |||
|
|||
type node struct { | |||
DC string `json:"dc"` | |||
Rack string `json:"rack"` | |||
Rack string `json:"rack_id"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a rack name not ID.
Example: https://github.com/scylladb/scylladb/blob/master/conf/cassandra-rackdc.properties
logger log.Logger | ||
} | ||
|
||
// getManifestsAndHosts checks that each host in target cluster should have an access to at least one target location and fetches manifest info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a bit confusing.
Each node must have the access to the location where its SSTables are stored.
1-1 restore provides mapping between source node and the target node.
Location defines the DC. If the DC is empty then it means (or should mean) that all DCs are here.
See https://manager.docs.scylladb.com/stable/backup#meta
The method must check that all nodes that are expected to restore particular DC have an access to the location which keeps SSTables of this DC.
The nodes-mapping for 1-1 restore defines which node is going to restore data from which DC.
nodesCountSet = strset.New() | ||
) | ||
for _, location := range target.Location { | ||
// Ignore location.DC because all mappings should be specified via nodes-mapping file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys, why not to revert the logic here ?
Instead of iterating over locations, iterate over target nodes to check if they are able to access the location keeping their expected SSTables ? It must have the access.
Actually, please ensure that each node have the access to the corresponding location.
pkg/service/one2onerestore/worker.go
Outdated
if len(allManifests) != nodesWithAccessCount || len(allManifests) != len(nodeStatus) { | ||
return nil, nil, fmt.Errorf("manifest count (%d) != target nodes (%d)", len(allManifests), len(nodesCountSet.List())) | ||
} | ||
|
||
return allManifests, nodesToHosts(nodeStatus), nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic will be completely different when you iterate over nodes instead of locations.
After ensuring that node has the access to the location, you just download corresponding manifest to check details like numer of shards and token ring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After ensuring that node has the access to the location, you just download corresponding manifest to check details like numer of shards and token ring.
That's exactly what is happening inside validateCluster method.
A few word in general about my implementation and what was the reasoning behind it:
In a nutshell we have a source cluster (backup) which nodes are represented by manifests file in the backup location(s) and a target cluster which nodes are actual live nodes in the cluster we want to restore data.
- Get source cluster nodes and target cluster nodes. (getManifestsAndHosts)
- Collect information needed to compare sources and target cluster nodes. (collectNodeValidationInfo) (here I iterate over nodes)
- Compare them using node mapping as rule how to match source cluster node and target cluster node. (checkOne2OneRestoreCompatibility)
If validation has passed successfully, then I can use nodes info from step 1 farther in the code, as I know - it's valid for 1-1-restore and each node has exact match.
Keeping the logic this way give us ability to keep validation logic in one place, without spreading it all across other parts of the code.
This logic will be completely different when you iterate over nodes instead of locations.
Here is how I see this logic
- find node dc by looking at nodes mappings
- find corresponding location by checking location.DC with DC from step 1
- Download manifest content. Two steps actually list dir and then download, because manifest path contains taskID which we don't know (or do we?)
- collect additional node info (token ring and etc)
- compare nodes info with manifest content.
Without getting into the details, the main difference between two solutions, is that in first we collect all the info and then do the validation, in second, we validate nodes one by one.
For me it's more a less the same, but if you prefer second over first, let me know and i'm gonna change this pr
DC string | ||
Rack string | ||
HostID string | ||
CPUCount int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it should be shards.
CPU defines amount of available CPUs on the machine.
Shards are defining how many CPUs are used by Scylla.
func TestMapTargetHostToSource(t *testing.T) { | ||
testCases := []struct { | ||
name string | ||
|
||
nodeMappings []nodeMapping | ||
targetHosts []Host | ||
expected map[string]Host | ||
expectedErr bool | ||
}{ | ||
{ | ||
name: "All hosts have mappings", | ||
nodeMappings: []nodeMapping{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are OK.
But you should verify it bit broader way.
Please keep validateClusters
as the method you test. Most likely it will require to use integration tests instead of this ones.
This replaces CPUCount with ShardCount for clusters comparision. Fixes typo in function name.
This adds a validation stage for 1-1-restore. The logic is as follows:
Fixes: #4201
Please make sure that: