Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy fails with "execution deadline exceeded" #73

Closed
1 task done
mueller-ma opened this issue Feb 7, 2025 · 16 comments
Closed
1 task done

Policy fails with "execution deadline exceeded" #73

mueller-ma opened this issue Feb 7, 2025 · 16 comments
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@mueller-ma
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I use this policy as ClusterAdmissionPolicy and it regularly fails for some entities. The policy reporter dashboard shows the error "internal server error: Guest call failure: guest code interrupted, execution deadline exceeded". How can I debug this issue further? Maybe there's a bug in the policy or just too few resources/timeout.

Expected Behavior

Don't fail.

Steps To Reproduce

No response

Environment

- OS: Linux
- Architecture: amd64
- Kubernetes: RKE2 v1.29.12+rke2r1
- Policy: ghcr.io/kubewarden/policies/env-variable-secrets-scanner:v0.1.7
- Kubewarden:
`
$ helm list
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
kubewarden-controller   kubewarden      20              2025-01-21 16:47:46.573115115 +0100 CET deployed        kubewarden-controller-4.0.1     v1.20.0
kubewarden-crds         kubewarden      6               2025-01-07 11:34:31.870156894 +0100 CET deployed        kubewarden-crds-1.12.1          v1.20.0
kubewarden-defaults     kubewarden      9               2025-01-07 11:35:56.220563119 +0100 CET deployed        kubewarden-defaults-2.7.1       v1.20.0
`

Anything else?

A few hours after the issue appears, the policy will succeed again.

@jvanz jvanz added the kind/bug Something isn't working label Feb 11, 2025
@jvanz jvanz added this to Kubewarden Feb 11, 2025
@jvanz jvanz moved this to Todo in Kubewarden Feb 11, 2025
@jvanz
Copy link
Member

jvanz commented Feb 11, 2025

Hi @mueller-ma, thanks for the report!

We have this protection in place to prevent policies from running indefinitely. You can find more information about this configuration, including how to increase the timeout or disable it, in the documentation.

At first glance, I couldn't see anything obvious in the code that would explain the timeout. The code is quite simple: it iterates over the environment variables in the containers and runs a scanner on each value.

From your report, I understand that all other policies are running fine, even during the same period when this policy fails. Is that correct? If so, I would (at least for now) consider it less likely that a lack of resources is the problem.

In the bug description, you mentioned that the policy fails for some entities. Are these always the same entities? I'm wondering if some entity has a very large value, which could cause the policy to take longer to decode the value and/or the scanner to take longer to check the input.

I think the first step would be to increase (or temporarily disable) the timeout and see if the failures stop. Please also check if any environment variable has a long value that might explain the parsing and verification time. Keep in mind that values are verified twice: once with the original value and again after decoding it from base64. Furthermore, in both cases, the value is split by \n. Therefore, there's a scanner call for each line in the input.

@jvanz jvanz moved this from Todo to Pending review in Kubewarden Feb 11, 2025
@jvanz jvanz self-assigned this Feb 11, 2025
@jvanz jvanz moved this from Pending review to Blocked in Kubewarden Feb 11, 2025
@mueller-ma
Copy link
Author

Thanks for your feedback. I keep the timeout as-is for now to see which entities let the check fail to see if they are the same or have a lot of env variables.

@jvanz
Copy link
Member

jvanz commented Feb 13, 2025

Thanks for your feedback. I keep the timeout as-is for now to see which entities let the check fail to see if they are the same or have a lot of env variables.

Sounds good to me. Let us know what you find! :)

@mueller-ma
Copy link
Author

mueller-ma commented Feb 18, 2025

Right now only this policy failed and only for one pod. The pod is part of a ReplicaSet and for the other pod in that ReplicaSet the policy succeeded.
The pods has 3 env vars with the longest value having 20 chars, so it shouldn't fail because of too many or too long env vars.

$ kubectl get pods/<pod> -o jsonpath="{.spec.containers[*].env}"  | wc -c
175

Also: When checking the ReplicaSet, maybe skip checking the pods of the ReplicaSet?

@viccuad viccuad moved this from Blocked to Todo in Kubewarden Feb 18, 2025
@mueller-ma
Copy link
Author

The policy failed on another cluster for a deployment and a pod of that deployment. Both have zero env vars set:

$ kubectl get pods <pod> -o jsonpath="{.spec.containers[*].env}"
$

@jvanz jvanz moved this from Todo to In Progress in Kubewarden Feb 25, 2025
@jvanz jvanz added this to the 1.23 milestone Feb 25, 2025
@jvanz
Copy link
Member

jvanz commented Feb 25, 2025

Okay, I'll investigate the issue further. But to answer your question:

Also: When checking the ReplicaSet, maybe skip checking the pods of the ReplicaSet?

You do not need to watch for the pods directly. You can watch for the high level resources. The policy will check the pods from it.

There is an issue to address what you just asked. But we do not have a final solution for that now. For now, the policies should handle that by themselves.

@jvanz
Copy link
Member

jvanz commented Feb 26, 2025

@mueller-ma Now that environment variables don't seem to be the problem, I'm considering a scenario where resources are too low. Therefore, I'm trying to simulate this issue locally by overloading the policy server with requests. However, it's often convoluted to simulate these kinds of scenarios. Thus, I have some suggestions to move this forward.

In the issue description, you mentioned that the policy reports show errors from the policy server. This indicates that the error occurs while the audit scanner is running. This makes sense considering the scenario of too low resources, as the audit scanner will send many requests to the policy server to evaluate the currently running resources. I'm assuming you have a large number of resources under evaluation. If you could share this information, that would be helpful. Then, we can get an idea of the load the audit scanner is generating.

The audit scanner has configurations to control how many requests it can send to the policy server. Could you share your current values for these configurations? Perhaps tuning these values could improve the audit scanner's performance in your environment.

Also, I suggest you enable telemetry and see if you can spot anything unusual happening. In the metrics, you can look for other policies failing and evaluation times. This will help us identify potential problems. Specifically, try to find peaks in evaluation times. Are they all generated by the audit scanner?

NOTE: If your organization has restrictions on sharing certain data about your environment, you can reach out to us on Slack to discuss it privately. We've done this in the past and are happy to gather that information from you. ;)

@jvanz jvanz moved this from In Progress to Blocked in Kubewarden Feb 26, 2025
@mueller-ma
Copy link
Author

I'm assuming you have a large number of resources under evaluation.

Actually it's quite the opposite. The issue from #73 (comment) happend on a cluster where only two namespaces are monitored by Kubewarden policies: the default namespace and a custom one. The catch-all is just a nginx serving one very small static page.
Apart from that the cluster hosts Rancher UI, but in ignored namepaces.

$ k get all -n default
NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
service/kubernetes   ClusterIP   <some ip>    <none>        443/TCP   369d
$ k get all -n catch-all
NAME                                    READY   STATUS    RESTARTS   AGE
pod/catch-all-webapp-7dfd498885-44bvt   1/1     Running   0          47h
pod/catch-all-webapp-7dfd498885-ts4qr   1/1     Running   0          47h

NAME                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/catch-all-webapp   ClusterIP   10.43.63.143   <none>        8080/TCP   232d

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/catch-all-webapp   2/2     2            2           232d

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/catch-all-webapp-7dfd498885   2         2         2       13d

The audit scanner has configurations to control how many requests it can send to the policy server.

These are the values for kubewarden-controller:

nodeSelector:
node-role.kubernetes.io/control-plane: "true"
replicas: 2
auditScanner:
policyReporter: true
policy-reporter:
ui:
  enabled: true
  ingress:
	enabled: true
	hosts:
	  - host: "example.com"
		paths:
		  - path: /
			pathType: ImplementationSpecific

There are 17 policies on the server:

$ k get clusteradmissionpolicies.policies.kubewarden.io --no-headers  | wc -l
17
$ k get admissionpolicies.policies.kubewarden.io
No resources found in default namespace.

@jvanz jvanz moved this from Blocked to In Progress in Kubewarden Feb 27, 2025
@jvanz
Copy link
Member

jvanz commented Feb 27, 2025

Ok, I think I'm now able to simulate the issue. I've tested with a Minikube cluster configured with 2 CPUs and 2GB of memory. In this cluster, I've installed 14 policies: all the recommended policies, some safe-labels and safe-annotations, and 5 instances of the env-var-secret-scanner policy.

NAME                                                POLICY SERVER   MUTATING   BACKGROUNDAUDIT   MODE      OBSERVED MODE   STATUS   AGE
do-not-run-as-root                                  default         true       true              monitor   monitor         active   9m2s
do-not-share-host-paths                             default         false      true              monitor   monitor         active   9m2s
drop-capabilities                                   default         true       true              monitor   monitor         active   9m2s
env-variable-secrets-scanner-17406943388196436491   default         false      true              protect   protect         active   8m59s
env-variable-secrets-scanner-17406943391899924692   default         false      true              protect   protect         active   8m59s
env-variable-secrets-scanner-17406943395718477223   default         false      true              protect   protect         active   8m59s
environment-variable-policy                         default         false      true              protect   protect         active   8m58s
no-host-namespace-sharing                           default         false      true              monitor   monitor         active   9m2s
no-privilege-escalation                             default         true       true              monitor   monitor         active   9m2s
no-privileged-pod                                   default         false      true              monitor   monitor         active   9m2s
safe-annotations                                    default         false      false             protect   protect         active   8m58s
safe-labels                                         default         false      false             protect   protect         active   8m58s
safe-labels-17406943399417047381                    default         false      false             protect   protect         active   8m58s
safe-labels-17406943403317894182                    default         false      false             protect   protect         active   8m58s

I'm observing errors similar to this when the audit scanner starts running:

2025-02-27T22:19:22.811847Z ERROR request{method=POST uri=/audit/clusterwide-env-variable-secrets-scanner-17406943395718477223 version=HTTP/1.1}:audit{host="policy-server-default-55d4554d4c-2dxmp" policy_id="clusterwide-env-variable-secrets-scanner-17406943395718477223" kind="Pod" kind_group="" kind_version="v1" name="prometheus-kube-state-metrics-898dd9b88-r6279" namespace="prometheus" operation="CREATE" request_uid="e4141381-abd5-4d15-95f7-e0894678fee9" resource="Pod" resource_group="" resource_version="v1" subresource=""}:validate{self=PolicyEvaluator { runtime: "wapc" } settings={}}: wasmtime_provider::provider: Failure invoking guest module handler: error while executing at wasm backtrace:
    0: 0x992a4 - <unknown>!dlfree
    1: 0x1821a6 - <unknown>!free
    2: 0x18218f - <unknown>!__rdl_dealloc
    3: 0x180cbb - <unknown>!__rust_dealloc
    4: 0x1464a5 - <unknown>!core::ptr::drop_in_place<regex::dfa::Cache>::hd4b5731558448f42
    5: 0x12d116 - <unknown>!core::mem::maybe_uninit::MaybeUninit<T>::assume_init_drop::hec6f3fa93b029b65
    6: 0xfa4f3 - <unknown>!<alloc::collections::btree::map::BTreeMap<K,V,A> as core::ops::drop::Drop>::drop::h87c74344a1a42e17
    7: 0x1de08 - <unknown>!env_variable_secrets_scanner_policy::validate::h35bbb91c592e1e61
    8: 0xb6961 - <unknown>!__guest_call
    9: 0x180de5 - <unknown>!__guest_call.command_export

Caused by:
    wasm trap: interrupt    
2025-02-27T22:19:22.813412Z ERROR request{method=POST uri=/audit/clusterwide-env-variable-secrets-scanner-17406943395718477223 version=HTTP/1.1}:audit{host="policy-server-default-55d4554d4c-2dxmp" policy_id="clusterwide-env-variable-secrets-scanner-17406943395718477223" kind="Pod" kind_group="" kind_version="v1" name="prometheus-grafana-c6df698db-n8cwn" namespace="prometheus" operation="CREATE" request_uid="8ac743df-6aab-467f-a189-94c66bd5283b" resource="Pod" resource_group="" resource_version="v1" subresource=""}:validate{self=PolicyEvaluator { runtime: "wapc" } settings={}}: wasmtime_provider::provider: Failure invoking guest module handler: error while executing at wasm backtrace:
    0: 0xcbbc7 - <unknown>!regex::compile::CompileClass::c_utf8_seq::h0efed6729a221bff
    1: 0x8129e - <unknown>!regex::compile::Compiler::c_class::h504ad9586ea3bda5
    2: 0x39884 - <unknown>!regex::compile::Compiler::c::h5554e2efe66eb004
    3: 0x1244fa - <unknown>!regex::compile::Compiler::c_concat::hc4569983351ade3e
    4: 0xaca94 - <unknown>!regex::compile::Compiler::c_repeat_range::h4257fe25849f4e12
    5: 0x38dfd - <unknown>!regex::compile::Compiler::c::h5554e2efe66eb004
    6: 0x39bf4 - <unknown>!regex::compile::Compiler::c::h5554e2efe66eb004
    7: 0x121bc3 - <unknown>!regex::compile::Compiler::c_capture::h921bae3d268d3d19
    8: 0x5f087 - <unknown>!regex::compile::Compiler::compile::hf6cf9f1a4c0aface
    9: 0x1379e - <unknown>!regex::exec::ExecBuilder::build::hc07d47f9d2ff6bc2
   10: 0x166130 - <unknown>!regex::re_builder::bytes::RegexBuilder::build::hb617485624f561ea
   11: 0xbecfa - <unknown>!core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once::hd5a10fc15d6b7ede
   12: 0xc4826 - <unknown>!<alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter::h7fe9835ebb992d48
   13: 0x146cfa - <unknown>!<alloc::collections::btree::map::BTreeMap<K,V> as core::iter::traits::collect::FromIterator<(K,V)>>::from_iter::h7ec2ffb3745c74c9
   14: 0x844ad - <unknown>!rusty_hog_scanner::SecretScannerBuilder::build::hf7505d7473b229b9
   15: 0x1ca63 - <unknown>!env_variable_secrets_scanner_policy::validate::h35bbb91c592e1e61
   16: 0xb6961 - <unknown>!__guest_call
   17: 0x180de5 - <unknown>!__guest_call.command_export

Caused by:
    wasm trap: interrupt    

2025-02-27T22:20:04.758916Z ERROR request{method=POST uri=/audit/clusterwide-env-variable-secrets-scanner-17406943395718477223 version=HTTP/1.1}:audit{host="policy-server-default-55d4554d4c-2dxmp" policy_id="clusterwide-env-variable-secrets-scanner-17406943395718477223" kind="Pod" kind_group="" kind_version="v1" name="my-replicaset-17406943380449525852-qc89s" namespace="default" operation="CREATE" request_uid="2a8b69f6-ccb6-41c6-b765-10320c26b9aa" resource="Pod" resource_group="" resource_version="v1" subresource=""}:validate{self=PolicyEvaluator { runtime: "wapc" } settings={}}: wasmtime_provider::provider: Failure invoking guest module handler: error while executing at wasm backtrace:
    0: 0x14f63e - <unknown>!core::ptr::drop_in_place<regex::prog::Program>::hcb5fbe39442d555e
    1: 0x16cb15 - <unknown>!alloc::sync::Arc<T>::drop_slow::ha8381feac729232b
    2: 0x17fdba - <unknown>!core::ptr::drop_in_place<alloc::sync::Arc<regex::exec::ExecReadOnly>>::hc1d3709ac4c32840
    3: 0x12cfa9 - <unknown>!core::mem::maybe_uninit::MaybeUninit<T>::assume_init_drop::hec6f3fa93b029b65
    4: 0xfa4f3 - <unknown>!<alloc::collections::btree::map::BTreeMap<K,V,A> as core::ops::drop::Drop>::drop::h87c74344a1a42e17
    5: 0x1de08 - <unknown>!env_variable_secrets_scanner_policy::validate::h35bbb91c592e1e61
    6: 0xb6961 - <unknown>!__guest_call
    7: 0x180de5 - <unknown>!__guest_call.command_export

Caused by:
    wasm trap: interrupt    

First, can you confirm that this is a similar error to what you're seeing in your environment? Does the stack trace show a similar path? My hypothesis is that the rusty-hog library, which we use to scan for secrets, is not very fast. This makes this particular policy a bit slower, which can cause issues in scenarios with many requests (such as those from the audit scanner) and limited resources. Therefore, a quick workaround is to increase the timeout configuration. If you don't want to do that for other policies, you could consider moving this specific policy to a dedicated policy server with a larger timeout. However, this would add another workload in a resource-constrained environment. Another option is to configure the audit scanner to be less aggressive with the policy server. This will take longer to run, but considering you have limited resources, I don't think it will be a major problem. In my example, I've reduced the auditScanner.parallelResources from 100 to 50.

Meanwhile, I'll take this opportunity to update all the dependencies of this policy, as they are out of date. I don't believe this will make a significant difference for this issue, but I'll also examine the stack traces to see if we can make any improvements."

@mueller-ma
Copy link
Author

How can I see that stacktrace? Right now I was only looking at the policy reporter UI. I reduced the parallelResources to 50 and check if the issue happens again.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Kubewarden Feb 28, 2025
@jvanz
Copy link
Member

jvanz commented Feb 28, 2025

I think this issue has been closed by accident. Reopening it.

@jvanz jvanz reopened this Feb 28, 2025
@jvanz jvanz moved this from Done to In Progress in Kubewarden Feb 28, 2025
@jvanz
Copy link
Member

jvanz commented Feb 28, 2025

How can I see that stacktrace? Right now I was only looking at the policy reporter UI. I reduced the parallelResources to 50 and check if the issue happens again.

In the policy server logs.

@jvanz jvanz moved this from In Progress to Blocked in Kubewarden Feb 28, 2025
@jvanz
Copy link
Member

jvanz commented Mar 10, 2025

I take this issue as an opportunity to also updates the policy to release the dependencies updates.

#79

@jvanz
Copy link
Member

jvanz commented Mar 11, 2025

@mueller-ma do you have some feedback about if the reduction of the parallelResources made some difference?

@mueller-ma
Copy link
Author

There was no issue after I reduced the parallelResources so far. I guess this issue can be closed.

@jvanz
Copy link
Member

jvanz commented Mar 12, 2025

That's great! Thanks for you patience and help testing everything. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants