-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish between different VDDK validation errors #969
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jonner for taking that issue and posting this fix!
I just realized that this is probably not a full fix. Example scenario:
|
Quality Gate passedIssues Measures |
The new patch I just pushed does not yet address the issue I discussed in this comment: #969 (comment) |
@jonner yeah and there's even more basic scenario that we don't handle so well -
honestly, we were not really sure whether it makes sense to add this job back then because the only case we've heard of about the VDDK image was when someone created it on a filesystem that doesn't support symbolic links.. but apparently, we see that this job actually detects issues in the field and in that case it makes less sense to expect from users to clone the plan or archive-and-recreate the plan in order to initiate another job... specifically about what you wrote above, |
pkg/controller/plan/validation.go
Outdated
pods := &core.PodList{} | ||
if err = ctx.Destination.Client.List(context.TODO(), pods, &client.ListOptions{ | ||
Namespace: plan.Spec.TargetNamespace, | ||
LabelSelector: k8slabels.SelectorFromSet(map[string]string{"job-name": job.Name}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, now this call would work in all scenarios but I think it should be placed elsewhere since in many reconcile-cycles we'll find the job in JobComplete
and won't need the information from the pod
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #969 +/- ##
=======================================
Coverage 16.36% 16.36%
=======================================
Files 109 109
Lines 19628 19682 +54
=======================================
+ Hits 3212 3221 +9
- Misses 16135 16178 +43
- Partials 281 283 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
It's true that the more basic scenario has those issues, but at least it was consistent before: It failed with
Well, you know the code better than I do, but in my testing, the pod related to the job doesn't seem to be retained after the job completes. Specifically, after the job completes, the call to |
yeah, I agree
unless I'm missing something, I read the changes in this PR as introducing logic that says: "once the job completes, we check the status of the pod that was triggered for the job to figure out the reason for a failure", so it assumes the pod exists also when the job is completed with a failure, right? |
Not quite. What happens is that while the job is running, At some point, the image will either be pulled (and the pod will start running), or the pod will time out waiting for the image. In the first scenario, the job should run to completion and either succeed or fail based on the validation result. In the second scenario, the job will fail. Immediately after the job finishes (either successfully or due to timing out), the next call to But the above description does not apply to a cached job. In my experience, there are no pods returned for a cached job, so there's no way to determine the failure reason in this case. But maybe I am simply doing something wrong and there is a way to get pods for a completed job that I don't know about? |
@jonner I've been investigating the issue with pod deletion, and it turns out that it happens only when the pod is not fully created. If the VDDK image is invalid (the pod is unable to pull the image) and the job reaches the After discussing this with @ahadas , we believe the best approach is to remove the What do you think of this solution? |
Interesting idea. There are some details of your proposed solution that aren't yet 100% clear in my mind, but I will investigate. It sounds promising. |
A validation job is labeled with both the UID of the plan as well as the md5sum of the vddk init image URL. So with the above approach the following statement is not actually true:
When the vddk url is updated for the source provider, the next reconcile will look up a Job associated with the given plan and the new vddk URL, and will find none. So the old job will continue running (trying to pull the old init image), but we will ignore it. Instead, we'll create a new Job for the new plan/URL combination and run that. So I think it'll require some slightly more complicated job management to handle this situation. |
a79d59f
to
0570a4d
Compare
Several functions accept a vddk image argument even though the vddk image can be retrieved directly from the plan. Signed-off-by: Jonathon Jongsma <[email protected]>
Signed-off-by: Jonathon Jongsma <[email protected]>
This code previously returned nil if the source and destination providers were not set for the plan when validating the vddk image, but it seems to make more sense to return an error instead. Signed-off-by: Jonathon Jongsma <[email protected]>
Rather than passing the labels to the function, just query it using the utility function. Signed-off-by: Jonathon Jongsma <[email protected]>
If the vddk validator pod fails, we don't need to keep re-trying. The container simply checks for the existence of a file, so restarting the pod is unlikely to change anything. In addition, by specifying `Never` for the restart policy, the completed pod should be retained for examination after the job fails, which can be helpful for determining the cause of failure. Signed-off-by: Jonathon Jongsma <[email protected]>
0570a4d
to
9d70a89
Compare
So, I took a slightly different approach than mentioned in my last comment. In the patch series that I just pushed, I did in fact remove the active deadline for the job, and changed a few additional things. I think it handles the scenarios that were discussed above. See the commit log for the "Distinguish between different VDDK validation errors" patch for more details |
9d70a89
to
4791fc9
Compare
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #969 +/- ##
=======================================
Coverage ? 16.27%
=======================================
Files ? 112
Lines ? 19858
Branches ? 0
=======================================
Hits ? 3232
Misses ? 16339
Partials ? 287
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There are multiple cases that can lead to a "VDDK Init image is invalid" error message for a migration plan. They are currently handled with a single VDDKInvalid condition. One of the most common is when the vddk image cannot be pulled (either due to network issues or due to the user typing an incorrect image URL). Categorizing this type of error as an "invalid VDDK image" is confusing to the user. When the initContainer cannot pull the VDDK init image, the vddk-validator-* pod has something like the following status: initContainerStatuses: - name: vddk-side-car state: waiting: reason: ErrImagePull message: 'reading manifest 8.0.3.14 in default-route-openshift-image-registry.apps-crc.testing/openshift/vddk: manifest unknown' lastState: {} ready: false restartCount: 0 image: 'default-route-openshift-image-registry.apps-crc.testing/openshift/vddk:8.0.3.14' imageID: '' started: false We can use the existence of the 'waiting' state on the pod to indicate that the image cannot be pulled. Unfortunately, the validation job's pods are deleted when the job fails due to a failure to pull the image. Because of this, there's no way to examine the pod status to see why the failure occurred after the deadline. So this patch removes the deadline from the validation job, which requires overhauling the validation logic slightly. We add a new advisory condition `VDDKInitImageNotReady` to indicate that we are still waiting to pull the VDDK init image, and a new critical condition `VDDKInitImageUnavailable` to indicate that the condition has persisted for longer than the active deadline setting. Since the job will now retry pulling the vddk image indefinitely (due to the removal of the job deadline), we need to make sure that orphaned jobs don't run forever. So when the vddk image for a plan changes, we need to cancel all active validation jobs that are still running for the old vddk image. This overall approach has several advantages: - The user gets an early indication (via `VDDKInitImageNotReady`) that the image can't be pulled - The validation will automatically complete when any network interruption is resolved, without needing to delete and re-create the plan to start a new validation - The validation will no longer report a VDDKInvalid error when the image pull is very slow due to network issues because there is no longer a deadline for the job. Resolves: https://issues.redhat.com/browse/MTV-1150 Signed-off-by: Jonathon Jongsma <[email protected]>
4791fc9
to
56de205
Compare
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drastic improvements of the VDDK validations!
Thank you so much for your help, LGTM!
There are multiple cases that can lead to a "VDDK Init image is invalid" error message for a migration plan. They are currently handled with a single VDDKInvalid condition.
This patch adds a new error condition VDDKImageNotFound (and a new advisory condition VDDKImageNotReady) to help diagnose an issue of a missing image or incorrect image url.
Resolves: https://issues.redhat.com/browse/MTV-1150