-
-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FailedAttachVolume
warning when moving between workflow steps with Kubernetes backend
#3869
Comments
WP
skip_clone: true
steps:
one:
image: alpine
commands:
- echo One
two:
image: alpine
commands:
- echo Two
three:
image: alpine
commands:
- echo Three
^ There is no Is the "bug" gone, if you schedule pods on the same Node? skip_clone: true
steps:
one:
image: alpine
commands:
- echo One
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/hostname: <name of node>
two:
image: alpine
commands:
- echo Two
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/hostname: <name of node, equals to value in first step>
BTW,
woodpecker/pipeline/pipeline.go Lines 228 to 285 in e5f3e67
Current behavior
Proposed behavior
What is the difference between examples in time aspect? |
Thank you for the suggestions! I was able to confirm that the I can replicate it using this workflow:
I observe:
The bug also disappeared on my real multi-step workflow when I hard-coded the specific node id. In our cluster though, all of the nodes are auto-scaling, so a particular node id is not guaranteed to exist at any one time. Most of our workflows have all of their steps all set to target:
Where the only requirement is that some resources are available, and that they go on a node with a On the point of current and proposed behavior, this is likely due to some ignorance on my part on how k8s volumes attach and detach, but I was assuming that the volume detaching was happening much sooner than 15 seconds. And that because it had actively failed to attach, there was a reasonable backoff of 15 seconds before a retry was made. That would make sense for a typical k8s microservice backend scenario where volumes being moved between nodes would not happen very frequently. So my belief of current versus proposed behavior (that I am not sure how I would validate) is: Current behavior
Proposed behavior
|
FailedAttachVolume
error when moving between workflow steps with Kubernetes backendFailedAttachVolume
warning when moving between workflow steps with Kubernetes backend
This should be validated.
Also, I'm not sure, that 500ms/15s (for example) is detaching time alone. It could be 10ms for detaching and 490ms for migration of data between nodes - attaching.
There was an attempt. |
I've reread #3345... It was implemented for However, nothing stops us to widen up this approach and save the Node, where the first Pod was run, then add affinity to that Node for subsequent Pods. Like #3345 (comment), but taking the info from the first Pod. Moreover, we do not need to grep Pod for the info separately, we already listening to Pod updates. Or we can stick Pods to the certain Node like K3s' Local path provisioner does. Persistent Volumes Node Affinity. That said, it requires Workflow-level backend options to be implemented first, if we want it to be adjustable per-workflow. Considering that it is not an error, but warning, I would qualify this not as |
Component
agent
Describe the bug
When I observe a workflow moving from one step to another, there is always a
FailedAttachVolume
event. This event is resolved 15 seconds later with aSuccessfulAttachVolume
.Steps to reproduce
Expected behavior
When moving between workflow steps, the woodpecker agent should gracefully transition the persistent volume between pods. If such a graceful transition is not possible, a faster retry than 15 seconds would make sense to speed up build times. Currently, for a typical 5 minute, 5 step build, 20% of the total execution time is waiting for this error to resolve.
System Info
Additional context
Here are the typical event logs from the pod of a second step in a run.
I am not very experienced with Kubernetes, and it is entirely possible that this is not a problem with woodpecker itself. From this code it appears that the triggering event to move from one step to another is straightforward. But perhaps that could be altered to check that the PV has been detached before continuing?
Validations
next
version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]The text was updated successfully, but these errors were encountered: