-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Policy fails with "execution deadline exceeded" #73
Comments
Hi @mueller-ma, thanks for the report! We have this protection in place to prevent policies from running indefinitely. You can find more information about this configuration, including how to increase the timeout or disable it, in the documentation. At first glance, I couldn't see anything obvious in the code that would explain the timeout. The code is quite simple: it iterates over the environment variables in the containers and runs a scanner on each value. From your report, I understand that all other policies are running fine, even during the same period when this policy fails. Is that correct? If so, I would (at least for now) consider it less likely that a lack of resources is the problem. In the bug description, you mentioned that the policy fails for some entities. Are these always the same entities? I'm wondering if some entity has a very large value, which could cause the policy to take longer to decode the value and/or the scanner to take longer to check the input. I think the first step would be to increase (or temporarily disable) the timeout and see if the failures stop. Please also check if any environment variable has a long value that might explain the parsing and verification time. Keep in mind that values are verified twice: once with the original value and again after decoding it from base64. Furthermore, in both cases, the value is split by |
Thanks for your feedback. I keep the timeout as-is for now to see which entities let the check fail to see if they are the same or have a lot of env variables. |
Sounds good to me. Let us know what you find! :) |
Right now only this policy failed and only for one pod. The pod is part of a $ kubectl get pods/<pod> -o jsonpath="{.spec.containers[*].env}" | wc -c
175 Also: When checking the ReplicaSet, maybe skip checking the pods of the ReplicaSet? |
The policy failed on another cluster for a deployment and a pod of that deployment. Both have zero env vars set:
|
Okay, I'll investigate the issue further. But to answer your question:
You do not need to watch for the pods directly. You can watch for the high level resources. The policy will check the pods from it. There is an issue to address what you just asked. But we do not have a final solution for that now. For now, the policies should handle that by themselves. |
@mueller-ma Now that environment variables don't seem to be the problem, I'm considering a scenario where resources are too low. Therefore, I'm trying to simulate this issue locally by overloading the policy server with requests. However, it's often convoluted to simulate these kinds of scenarios. Thus, I have some suggestions to move this forward. In the issue description, you mentioned that the policy reports show errors from the policy server. This indicates that the error occurs while the audit scanner is running. This makes sense considering the scenario of too low resources, as the audit scanner will send many requests to the policy server to evaluate the currently running resources. I'm assuming you have a large number of resources under evaluation. If you could share this information, that would be helpful. Then, we can get an idea of the load the audit scanner is generating. The audit scanner has configurations to control how many requests it can send to the policy server. Could you share your current values for these configurations? Perhaps tuning these values could improve the audit scanner's performance in your environment. Also, I suggest you enable telemetry and see if you can spot anything unusual happening. In the metrics, you can look for other policies failing and evaluation times. This will help us identify potential problems. Specifically, try to find peaks in evaluation times. Are they all generated by the audit scanner? NOTE: If your organization has restrictions on sharing certain data about your environment, you can reach out to us on Slack to discuss it privately. We've done this in the past and are happy to gather that information from you. ;) |
Actually it's quite the opposite. The issue from #73 (comment) happend on a cluster where only two namespaces are monitored by Kubewarden policies: the
These are the values for
There are 17 policies on the server:
|
Ok, I think I'm now able to simulate the issue. I've tested with a Minikube cluster configured with 2 CPUs and 2GB of memory. In this cluster, I've installed 14 policies: all the recommended policies, some safe-labels and safe-annotations, and 5 instances of the
I'm observing errors similar to this when the audit scanner starts running:
First, can you confirm that this is a similar error to what you're seeing in your environment? Does the stack trace show a similar path? My hypothesis is that the Meanwhile, I'll take this opportunity to update all the dependencies of this policy, as they are out of date. I don't believe this will make a significant difference for this issue, but I'll also examine the stack traces to see if we can make any improvements." |
How can I see that stacktrace? Right now I was only looking at the policy reporter UI. I reduced the |
I think this issue has been closed by accident. Reopening it. |
In the policy server logs. |
I take this issue as an opportunity to also updates the policy to release the dependencies updates. |
@mueller-ma do you have some feedback about if the reduction of the |
There was no issue after I reduced the |
That's great! Thanks for you patience and help testing everything. :) |
Is there an existing issue for this?
Current Behavior
I use this policy as
ClusterAdmissionPolicy
and it regularly fails for some entities. The policy reporter dashboard shows the error "internal server error: Guest call failure: guest code interrupted, execution deadline exceeded". How can I debug this issue further? Maybe there's a bug in the policy or just too few resources/timeout.Expected Behavior
Don't fail.
Steps To Reproduce
No response
Environment
Anything else?
A few hours after the issue appears, the policy will succeed again.
The text was updated successfully, but these errors were encountered: