Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't track attachments of NFS volumes #937

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

timebertt
Copy link
Member

/kind bug

What this PR does / why we need it:

This PR filters out NFS volumes when draining nodes, so that MCM doesn't try to track their detachment and re-attachment.
There is no such thing as an attachment to a node for NFS volumes. NFS volumes only need to be mounted (by kubelet).

Accordingly, MCM always runs into the configured PV reattach timeout, if there are pods with NFS PVCs on the to-be-drained node.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

cc @xoxys

Release note:

A bug has been fixed for draining nodes with mounted NFS volumes. With this fix, machine-controller-manager doesn't try to track their (non-existing) VolumeAttachments.

@timebertt timebertt requested a review from a team as a code owner August 23, 2024 10:31
@gardener-robot
Copy link

@timebertt Thank you for your contribution.

@gardener-robot-ci-3
Copy link
Contributor

Thank you @timebertt for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

@gardener-robot gardener-robot added the size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) label Aug 23, 2024
@timebertt
Copy link
Member Author

We probably need to add tests for this PR. However, we wanted to open it to get some feedback on whether this is the correct approach/place to solve the issue.

@kon-angelo
Copy link
Contributor

@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with attachRequired == false ? (Not in this PR even)

@timebertt
Copy link
Member Author

@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with attachRequired == false ? (Not in this PR even)

Sounds good. I had the impression that the attachment tracking doesn't work at all if there are no VolumeAttachment objects, i.e., no CSI driver is used, and the operation just runs into the configured timeout. Is this correct?
If so, we should probably change this PR from "skip tracking NFS volumes" to "skip tracking non-CSI volumes".

Another PR could then enhance the check for CSI drivers where no VolumeAttachments are used.

@timebertt
Copy link
Member Author

@kon-angelo @gardener/mcm-maintainers can you provide feedback on the PR and the above discussion?

@unmarshall
Copy link
Contributor

unmarshall commented Sep 3, 2024

@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with attachRequired == false ? (Not in this PR even)

I agree to @kon-angelo's observation that this can be generically handled. @timebertt since you have raised this PR we can also do this in 2 steps. We can skip NFS volumes (as you have done in this PR) and then later we can generalise this. Since generalisation would required additional lookup of the CSIDriver object and looking up its AttachRequired property to decide if wait for detachment for such a PV can be skipped.

@kon-angelo
Copy link
Contributor

kon-angelo commented Sep 3, 2024

Aside from my previous point, I kind of understand @timebertt 's suggestion. The MCM code relies on volumeattachments which also to my knowledge is only created for CSI volumes. We could probably update the logic in this PR to skip anything other than CSI volumes. (but my experience is just going through the code and this can be verified from one of the actual mcm maintainers).

Though one point is that I am not particularly fond that we "skip" reporting the existence of volumes, rather than not accounting them when draining - if it makes sense. Like this can be somewhat troubling to debug I find. But the actual function doing the work evictPodsWithPVInternal that also uses the getVolIDsFromDriver is used in many places and so I understand this implementation.

I just feel that at some point we will go down a debugging rabbit hole because getVolIDsFromDriver also does a little more extra than the name suggests which is the skipping part 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bug needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants