Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(1-1-restore): validates if source and target clusters nodes are equal #4230

Open
wants to merge 8 commits into
base: va/fast-restore-part-2
Choose a base branch
from

Conversation

VAveryanov8
Copy link
Collaborator

This adds a validation stage for 1-1-restore. The logic is as follows:

  • Collect node information for the source cluster from backup manifests
  • Collect node information for the target cluster from the Scylla API
  • Apply node mappings to the source cluster nodes
  • Compare each source node with its corresponding target node

Fixes: #4201


Please make sure that:

  • Code is split to commits that address a single change
  • Commit messages are informative
  • Commit titles have module prefix
  • Commit titles have issue nr. suffix

@VAveryanov8 VAveryanov8 marked this pull request as ready for review January 27, 2025 17:30
pkg/service/one2onerestore/worker_manifest.go Outdated Show resolved Hide resolved
pkg/service/one2onerestore/worker_manifest.go Outdated Show resolved Hide resolved
pkg/service/one2onerestore/worker_validate.go Outdated Show resolved Hide resolved
pkg/service/one2onerestore/worker_validate.go Outdated Show resolved Hide resolved
pkg/service/one2onerestore/worker_validate.go Outdated Show resolved Hide resolved
pkg/service/one2onerestore/worker_validate.go Outdated Show resolved Hide resolved
pkg/service/one2onerestore/worker_validate.go Outdated Show resolved Hide resolved
pkg/service/one2onerestore/worker_validate.go Outdated Show resolved Hide resolved
pkg/service/one2onerestore/worker_validate.go Outdated Show resolved Hide resolved
pkg/service/one2onerestore/worker_validate.go Outdated Show resolved Hide resolved
@VAveryanov8 VAveryanov8 force-pushed the va/fast-restore-part-2 branch from 37409c0 to 7e9f877 Compare January 29, 2025 14:35
…equal

This adds a validation stage for 1-1-restore. The logic is as follows:
- Collect node information for the source cluster from backup manifests
- Collect node information for the target cluster from the Scylla API
- Apply node mappings to the source cluster nodes
- Compare each source node with its corresponding target node

Fixes: #4201
This changes following parts of validation process:
- Moves path.Join("backup", string(MetaDirKind)) to `backupspec` pkg
- Moves getManifestContext to worker_manifest
- Adds SourceClusterID validation to getManifestInfo
- Simplifies how nodes info is collected by leveraging node mappings (maps manifests to
 nodes by host id)
- Replaces LocationInfo struct with manifests and hosts
- Sorts node tokens
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation process looks way cleaner now:)

nodesCountSet = strset.New()
)
for _, location := range target.Location {
// Ignore location.DC because all mappings should be specified via nodes-mapping file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good direction, but then we shouldn't allow user to specify location DC in the first place.
Or, we can still treat it as a hint if it simplifies the implementation, although I'm not sure of that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, why not to revert the logic here ?
Instead of iterating over locations, iterate over target nodes to check if they are able to access the location keeping their expected SSTables ? It must have the access.

Actually, please ensure that each node have the access to the corresponding location.

}
}

nodesWithAccessCount := len(nodesCountSet.List())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use nodesCountSet.Size().

return nil
}

func checkOne2OneRestoreCompatiblity(sourceNodeInfo, targetNodeInfo []nodeValidationInfo, nodeMappings []nodeMapping) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: checkOne2OneRestoreCompatiblity -> checkOne2OneRestoreCompatibility

DC string
Rack string
HostID string
CPUCount int64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be shard count instead of cpu count?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it should be shards.
CPU defines amount of available CPUs on the machine.
Shards are defining how many CPUs are used by Scylla.

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments.
One is about changing the logic - to iterate over target nodes instead of locations. This is the most important for me.

@@ -28,10 +28,17 @@ type nodeMapping struct {

type node struct {
DC string `json:"dc"`
Rack string `json:"rack"`
Rack string `json:"rack_id"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger log.Logger
}

// getManifestsAndHosts checks that each host in target cluster should have an access to at least one target location and fetches manifest info.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit confusing.
Each node must have the access to the location where its SSTables are stored.

1-1 restore provides mapping between source node and the target node.

Location defines the DC. If the DC is empty then it means (or should mean) that all DCs are here.
See https://manager.docs.scylladb.com/stable/backup#meta

The method must check that all nodes that are expected to restore particular DC have an access to the location which keeps SSTables of this DC.
The nodes-mapping for 1-1 restore defines which node is going to restore data from which DC.

nodesCountSet = strset.New()
)
for _, location := range target.Location {
// Ignore location.DC because all mappings should be specified via nodes-mapping file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys, why not to revert the logic here ?
Instead of iterating over locations, iterate over target nodes to check if they are able to access the location keeping their expected SSTables ? It must have the access.

Actually, please ensure that each node have the access to the corresponding location.

Comment on lines 58 to 63
if len(allManifests) != nodesWithAccessCount || len(allManifests) != len(nodeStatus) {
return nil, nil, fmt.Errorf("manifest count (%d) != target nodes (%d)", len(allManifests), len(nodesCountSet.List()))
}

return allManifests, nodesToHosts(nodeStatus), nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic will be completely different when you iterate over nodes instead of locations.

After ensuring that node has the access to the location, you just download corresponding manifest to check details like numer of shards and token ring.

Copy link
Collaborator Author

@VAveryanov8 VAveryanov8 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After ensuring that node has the access to the location, you just download corresponding manifest to check details like numer of shards and token ring.

That's exactly what is happening inside validateCluster method.

A few word in general about my implementation and what was the reasoning behind it:
In a nutshell we have a source cluster (backup) which nodes are represented by manifests file in the backup location(s) and a target cluster which nodes are actual live nodes in the cluster we want to restore data.

  1. Get source cluster nodes and target cluster nodes. (getManifestsAndHosts)
  2. Collect information needed to compare sources and target cluster nodes. (collectNodeValidationInfo) (here I iterate over nodes)
  3. Compare them using node mapping as rule how to match source cluster node and target cluster node. (checkOne2OneRestoreCompatibility)

If validation has passed successfully, then I can use nodes info from step 1 farther in the code, as I know - it's valid for 1-1-restore and each node has exact match.
Keeping the logic this way give us ability to keep validation logic in one place, without spreading it all across other parts of the code.

This logic will be completely different when you iterate over nodes instead of locations.

Here is how I see this logic

  1. find node dc by looking at nodes mappings
  2. find corresponding location by checking location.DC with DC from step 1
  3. Download manifest content. Two steps actually list dir and then download, because manifest path contains taskID which we don't know (or do we?)
  4. collect additional node info (token ring and etc)
  5. compare nodes info with manifest content.

Without getting into the details, the main difference between two solutions, is that in first we collect all the info and then do the validation, in second, we validate nodes one by one.

For me it's more a less the same, but if you prefer second over first, let me know and i'm gonna change this pr

DC string
Rack string
HostID string
CPUCount int64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it should be shards.
CPU defines amount of available CPUs on the machine.
Shards are defining how many CPUs are used by Scylla.

Comment on lines +12 to +23
func TestMapTargetHostToSource(t *testing.T) {
testCases := []struct {
name string

nodeMappings []nodeMapping
targetHosts []Host
expected map[string]Host
expectedErr bool
}{
{
name: "All hosts have mappings",
nodeMappings: []nodeMapping{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are OK.
But you should verify it bit broader way.
Please keep validateClusters as the method you test. Most likely it will require to use integration tests instead of this ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants