-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CephFS: prevent hanging NodeGetVolumeStats on stat()
syscall when an MDS is slow
#4200
Conversation
bbeced9
to
4e8007a
Compare
internal/cephfs/nodeserver.go
Outdated
err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath) | ||
log.WarningLog(ctx, "failed to start healtchecker: %v", err) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the volume is already mounted (case where csi plugin just restarted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, there it no health-checker.
Possibly NodeGetVolumeStats
can be extended to start the checker to catch this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this , can we have a tracker for this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, makes sense. #4219 has been created for this.
internal/cephfs/nodeserver.go
Outdated
@@ -599,6 +606,9 @@ func (ns *NodeServer) NodeUnstageVolume( | |||
} | |||
|
|||
volID := req.GetVolumeId() | |||
|
|||
ns.healthChecker.StopChecker(volID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should happen right before umount but after all other operations right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be done early, yes. Unmounting may not be possible if the checker writes/reads from the mountpoint at the same time.
internal/cephfs/nodeserver.go
Outdated
@@ -694,6 +711,18 @@ func (ns *NodeServer) NodeGetVolumeStats( | |||
return nil, status.Error(codes.InvalidArgument, err.Error()) | |||
} | |||
|
|||
// health check first, return without stats if unhealthy | |||
healthy, msg := ns.healthChecker.IsHealthy(req.GetVolumeId()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if the NodeGetVolumeStats is called right after the NodeStageVolume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before running the 1st check, the volume is regarded healthy. It is the assumption that when mounting works, the volume is still healthy shortly after.
internal/cephfs/nodeserver.go
Outdated
if err != nil { | ||
return nil, err | ||
} | ||
|
||
return res, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont need this check and can have return csicommon.FilesystemNodeGetVolumeStats(ctx, ns.Mounter, targetPath, false)
as it was before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yes. It was modified in an earlier version to set res.VolumeCondition = ...
after getting the stats. ... that wasn't my smartest move, as getting the stats got hung, so the VolumeCondition
was never set and returned 🤦♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// command is what is sent through the channel to terminate the go routine. | ||
type command string | ||
|
||
const ( | ||
// stopCommand is sent through the channel to stop checking. | ||
stopCommand = command("STOP") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can make use of empty struct to signal between go routines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is possible, but what is the advantage?
I've chosen this way to make it extendable if needed, it does not matter at the moment what is sent over the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In most cases empty structs are preferred to signal in the go routines, that's why the suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a recommendation that is documented anywhere? I have seen several examples with integers and booleans, or structs that contain one or more members...
fc.healthy = false | ||
fc.err = fmt.Errorf("health-check has not responded for %f seconds", delay.Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be any problem with multiple goroutines trying to update the same variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. There is a very small race in case one go-routine does not enter this if-statement, and an other does + updates the fc.err
before the 1st go-routines exits the function. I don't think it is a realistic failure scenario to account for.
A same way can happen with this function and concurrent runChecker()
. But it also looks quite an edge case.
If you insist, I can add some locking around updating/reading fc.healthy
and fc.err
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locking has been added.
return fmt.Errorf("failed to created workdir %q for health-checker: %w", workdir, err) | ||
} | ||
|
||
cc := newFileChecker(workdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This always starts a FileChecker what about BlockChecker for future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a BlockChecker
is introduced, the Manager
interface will need some modifications in any case.
} else { | ||
// 'cc' was stored, start it only once | ||
cc.start() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can happen only for Static PVC? because the NodeStageVolume will be called only once per volumeID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeStageVolume
might have failed/timedout after the checker was started. Calling NodeStageVolume
again should be a no-op in that case, this just ensures some idempotency.
internal/health-checker/manager.go
Outdated
|
||
cc := newFileChecker(workdir) | ||
|
||
// load the 'old' ConditionChecker if it exists, otherwuse store 'cc' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwuse
to otherwise
@Madhu-1 and @riya-singhal31 , I have manually tested this again (by scaling MDS's down/up) and it still works as before. Could you review again and mark the addressed comments as resolved? Many thanks! |
internal/cephfs/nodeserver.go
Outdated
@@ -270,6 +272,9 @@ func (ns *NodeServer) NodeStageVolume( | |||
} | |||
} | |||
|
|||
err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath) | |||
log.WarningLog(ctx, "failed to start healtchecker: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.WarningLog(ctx, "failed to start healtchecker: %v", err) | |
log.WarningLog(ctx, "failed to start healthchecker: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions, Mostly LGTM
} | ||
|
||
func (hcm *healthCheckManager) StartChecker(volumeID, path string) error { | ||
workdir := filepath.Join(path, ".csi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the PVC already has this path or the application doesn't expect any folders inside the mounted directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other filesystems also maintain their own metadata in the root, like lost+found
on ext4. Applications that (do not) expect certain filesystem internal metadata should be quite rare, as they are not very portable.
Would you prefer to have an option per volume to use a different directory, or disable it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree on that one, this is something we are introducing, but not all the application owners/developers will be aware of it. IMHO we should not write or touch anything in the path that is provided to the application pods, I would prefer the below
- Mount the subvolume to the stagingPath and create a new directory as in the design doc and mount that directory to the targetPath
- Do health check only inside the stagingPath which is not available to the application pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For legacy volumes, we could also do a minimal stat()
on the root of the volume, instead of writing/reading a timestamp. It might not be able to detect all failures scenarios, but it would be a non-intrusive check.
Newly created volumes can have their own data
subdirectory that is mapped for use inside of the container, leaving the root of the volume for contents that is not visible to applications. That can be done in PR #4125 once this one is accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For legacy volumes, we could also do a minimal stat() on the root of the volume, instead of writing/reading a timestamp. It might not be able to detect all failures scenarios, but it would be a non-intrusive check.
It's a temporary problem, once the node gets rebooted or the application gets restagted we will move to the new format. We can live with it.
Newly created volumes can have their own data subdirectory that is mapped for use inside of the container, leaving the root of the volume for contents that is not visible to applications. That can be done in PR #4125 once this one is accepted.
Sounds good 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added statChecker
and will use that in the nodeserver for now. fileChecker
is still included so that the refactoring with a base checker
makes sense.
internal/cephfs/nodeserver.go
Outdated
@@ -270,6 +272,9 @@ func (ns *NodeServer) NodeStageVolume( | |||
} | |||
} | |||
|
|||
err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same check needs to be done at line 233 where we are returning success if the directory is already mounted
internal/cephfs/nodeserver.go
Outdated
err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath) | ||
log.WarningLog(ctx, "failed to start healtchecker: %v", err) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this , can we have a tracker for this one?
internal/cephfs/nodeserver.go
Outdated
@@ -270,6 +275,9 @@ func (ns *NodeServer) NodeStageVolume( | |||
} | |||
} | |||
|
|||
err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath) | |||
log.WarningLog(ctx, "failed to start healthchecker: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be done inside if err !=nil
check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
internal/cephfs/nodeserver.go
Outdated
@@ -228,6 +230,9 @@ func (ns *NodeServer) NodeStageVolume( | |||
return nil, status.Error(codes.Internal, err.Error()) | |||
} | |||
|
|||
err = ns.healthChecker.StartChecker(req.GetVolumeId(), stagingTargetPath) | |||
log.WarningLog(ctx, "failed to start healthchecker: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be done inside if err !=nil
check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
internal/health-checker/manager.go
Outdated
// load the 'old' ConditionChecker if it exists | ||
old, ok := hcm.checkers.Load(volumeID) | ||
if !ok { | ||
return true, fmt.Errorf("no ConditionChecker for volume-id: %s", volumeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error will not checked as we are returning volume as Healthy and also the error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check for that now, will start a checker in that case too.
// 'old' was loaded, cast it to ConditionChecker | ||
cc, ok := old.(ConditionChecker) | ||
if !ok { | ||
return true, fmt.Errorf("failed to cast cc to ConditionChecker for volume-id %q", volumeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error will not checked as we are returning volume as Healthy and also the error message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an error in the internals of the health-checker, it is assumed that the volume is healthy. Only when there really is a problem with the volume, the status is reported as unhealthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we log and also document this one? we might need to have better logging to check whether the health checker is running and what was the latest run details (by looking at the logs), maybe we can do it as an enhancement later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added more docs about it now.
Lots of manual testing done. including the following combinations: pod uses PVC and
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Niels, LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left a question and small nit
// volume is already staged and published. | ||
if healthy && msg != nil { | ||
// Start a StatChecker for the mounted targetPath, yhis prevents | ||
// writing a file in the user-visible location. Ideally a (shared) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yhis
to this
// writing a file in the user-visible location. Ideally a (shared) | ||
// FileChecker is started with the stagingTargetPath, but we can't | ||
// get the stagingPath from the request. | ||
err = ns.healthChecker.StartChecker(req.GetVolumeId(), targetPath, hc.StatCheckerType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use something like below to get the stagingPath from targetPath
ceph-csi/internal/rbd/nodeserver.go
Lines 1079 to 1094 in c09700b
func (ns *NodeServer) getStagingPath(volPath string) (string, error) { | |
mounts, err := ns.Mounter.GetMountRefs(volPath) | |
if err != nil { | |
return "", err | |
} | |
for _, mount := range mounts { | |
// strip the last directory from the staging path | |
stp := strings.Split(mount, "/") | |
stagingTargetPath := strings.Join(stp[:len(stp)-1], "/") | |
if checkRBDImageMetadataStashExists(stagingTargetPath) { | |
return stagingTargetPath, nil | |
} | |
} | |
return "", fmt.Errorf("failed to get staging path for volume %s", volPath) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this could be an enhancement. I am not sure if it is worth the effort and additional complexity though.
When a FileChecker is used, it would make more sense as that could detect the existence of the file holding the timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a TODO comment with reference to rbd.getStagingPath()
// writing a file in the user-visible location. Ideally a (shared) | ||
// FileChecker is started with the stagingTargetPath, but we can't | ||
// get the stagingPath from the request. | ||
err = ns.healthChecker.StartChecker(req.GetVolumeId(), targetPath, hc.StatCheckerType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also see a very small race for the below case
- During NodeUnpublish cephcsi stopped the health checker
- Before umount operation we got a
NodeGetVolumeStats
RPC call which inturn started a health checker - Now we have a stale health check which will be cleanedup only during NodeUnstage (i think)
Is it possible to have such a case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not expect the CO to request NodeUnpublish
and then still use the same publishTargetPath
in a later NodeGetVolumeStats
. If a CO does that, then yes, there would be a stale health checker (if it was started in a NodeGetVolumeStats
call, and not in NodeStageVolume
).
@Madhu-1 I have addressed your latest concerns as well. Please check again, thanks! |
Pull request has been modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks @nixpanic
@Mergifyio rebase |
Signed-off-by: Niels de Vos <[email protected]>
Signed-off-by: Niels de Vos <[email protected]>
The HealthChecker is configured to use the Staging path pf the volume, with a `.csi/` subdirectory. In the future this directory could be a directory that is not under the Published directory. Fixes: ceph#4219 Signed-off-by: Niels de Vos <[email protected]>
When FilesystemNodeGetVolumeStats() succeeds, the volume must be healthy. This can be included in the VolumeCondition CSI message by default. Checks that detect an abnormal VolumeCondition should prevent calling FilesystemNodeGetVolumeStats() as it is possible that the function will hang. Signed-off-by: Niels de Vos <[email protected]>
✅ Branch has been successfully rebased |
@Mergifyio queue |
✅ The pull request has been merged automaticallyThe pull request has been merged automatically at 4d3b1fc |
/test ci/centos/upgrade-tests-cephfs |
/test ci/centos/upgrade-tests-rbd |
/test ci/centos/k8s-e2e-external-storage/1.27 |
/test ci/centos/k8s-e2e-external-storage/1.26 |
/test ci/centos/mini-e2e-helm/k8s-1.27 |
/test ci/centos/k8s-e2e-external-storage/1.28 |
/test ci/centos/mini-e2e/k8s-1.27 |
/test ci/centos/mini-e2e-helm/k8s-1.26 |
/test ci/centos/mini-e2e-helm/k8s-1.28 |
/test ci/centos/mini-e2e/k8s-1.28 |
/test ci/centos/mini-e2e/k8s-1.26 |
When an MDS is slow, the
NodeGetVolumeStats
CSI procedure may hang for a while. In case the MDS does not respond at all (or the reply is lost), thestat()
syscall may hang indefinitely.Because Kubelet retries the
NodeGetVolumeStats
CSI procedure regularly, the number of hangingstat()
syscalls can pike up. Each hanging call will consume a thread (go routine), and this may eventually starve the processing capabilities of the CephFS CSI-nodeplugin container.The commits in this PR come from #4125, which is planned to have a little more features (like a separate
data/
directory for the contents of the volume).Manual Testing
NodeGetVolumeStats
calls are in the CephFS node-plugin logsNodeGetVolumeStats
callsResults (see
GRPC response
):Scale up the MDS(s) again, see that the volume becomes healthy again:
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>
: retest the<job-name>
after unrelatedfailure (please report the failure too!)