You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Define a crossplane role on AWS which can not delete bucket data (lacking permission s3:DeleteObjectVersion specifically).
Deploy a AWS S3 Bucket, using Crossplane, with:
2.1. versioning enabled
2.2. forceDestroy set to true
2.3 Orphan not allowed
Fill the bucket with data
Delete the managed bucket from K8s
Deleting the Bucket MR from Crossplane will require it to issue a delete request to the ER on AWS. Due to the forceDelete set to true, it will attempt to delete the stored objects first. As it lacks the permissions to delete versions, it will get into a rapid fail-retry attempts which result in filling up the Etcd mvcc database and causing DoS of the K8s api (Error from server: etcdserver: mvcc: database space exceeded).
What happened?
Our usecase involved setting up an S3 bucket, along with bucketpolicy, bucketversioning, ... in a composition.
We had tens of managed buckets, and external buckets held millions of objects.
We ran into a situation where our policy denied crossplane role itself. In parallel to this, the manifests creating the bucket resources, had been (mis)configured to delete the external resource upon the deletion of the managed resource, and to destroy the data.
When the manifests were removed from code base (in a mid-term commit while working on long term fix for our issue), the GitOps (ArgoCD) issued a delete ("prune") to the managed resources.
These deleted resources had been denied by AWS as the policy denied crossplane role itself.
When the role was enabled back (with RO permissions, no write/delete), crossplane (finalizer) reassumed control and tried to proceed with the deletion and delete the objects within the external buckets to comply to its forceDestroy configuration.
This resulted in depletion of the space of the Etcd database and in an outage for K8s (the API became unresponsive).
Investigating this after the recovery, we found that Crossplane is storing the error from AWS SDK on the bucket object and is updating the object with the new error each time the action fails, which is often because it keeps retrying, and each time it updates the resource it creates a new version in etcd, which uses space until it's compacted and defragged.
We believe the logic of retries - in general maybe, and in this case of client-side error of permissions specifically - should be revisited and suitable back-off be installed, to avoid flooding the Etcd database.
We will be happy to provide more information if needed.
Relevant Error Output Snippet
While issuing API requests to K8s we received (e.g. kubectl get pods):
Error from server: etcdserver: mvcc: database space exceeded.
The AWS provider log had many of these errors:
AccessDenied: User: arn:aws:sts::<account id>:assumed-role/<crossplane role>/<session> is not authorized to perform: s3:DeleteObjectVersion on resource: \"arn:aws:s3:::<bucket name>/<object path>\" because no identity-based policy allows the s3:DeleteObjectVersion action []}]", ...
Looking into etcd snapshot, on one example, we can see that crossplane changed the resource hundreds of thousands of times since creation, and thus generated as many versions of the full object in etcd
Deleting the entire crossplane system to stop the flood, then forcefully clean up remaining MR's, or
Annotate the remaining objects with the "paused=true" annotation, patch MR's with deletionPolicy: "Orphan" (or the equivalent managementPolicies) and forceDelete: false
Bring back crossplane, or remove the crossplane.io/paused annotation, which will complete the deletion in-process, without impacting the external resources
Deploy the MR's with suitable crossplane.io/external-name, which made it "import" the ER and get "back on track"
The text was updated successfully, but these errors were encountered:
Is there an existing issue for this?
Affected Resource(s)
s3.aws.upbound.io/Bucket/v1beta1 - Bucket
Resource MRs required to reproduce the bug
Steps to Reproduce
s3:DeleteObjectVersion
specifically).2.1. versioning enabled
2.2.
forceDestroy
set totrue
2.3 Orphan not allowed
Deleting the Bucket MR from Crossplane will require it to issue a delete request to the ER on AWS. Due to the
forceDelete
set totrue
, it will attempt to delete the stored objects first. As it lacks the permissions to delete versions, it will get into a rapid fail-retry attempts which result in filling up the Etcd mvcc database and causing DoS of the K8s api (Error from server: etcdserver: mvcc: database space exceeded
).What happened?
Our usecase involved setting up an S3 bucket, along with bucketpolicy, bucketversioning, ... in a composition.
We had tens of managed buckets, and external buckets held millions of objects.
We ran into a situation where our policy denied crossplane role itself. In parallel to this, the manifests creating the bucket resources, had been (mis)configured to delete the external resource upon the deletion of the managed resource, and to destroy the data.
When the manifests were removed from code base (in a mid-term commit while working on long term fix for our issue), the GitOps (ArgoCD) issued a delete ("prune") to the managed resources.
These deleted resources had been denied by AWS as the policy denied crossplane role itself.
When the role was enabled back (with RO permissions, no write/delete), crossplane (finalizer) reassumed control and tried to proceed with the deletion and delete the objects within the external buckets to comply to its
forceDestroy
configuration.This resulted in depletion of the space of the Etcd database and in an outage for K8s (the API became unresponsive).
Investigating this after the recovery, we found that Crossplane is storing the error from AWS SDK on the bucket object and is updating the object with the new error each time the action fails, which is often because it keeps retrying, and each time it updates the resource it creates a new version in etcd, which uses space until it's compacted and defragged.
We believe the logic of retries - in general maybe, and in this case of client-side error of permissions specifically - should be revisited and suitable back-off be installed, to avoid flooding the Etcd database.
We will be happy to provide more information if needed.
Relevant Error Output Snippet
kubectl get pods
):Crossplane Version
1.16.0
Provider Version
1.9.1
Kubernetes Version
1.24.17
Kubernetes Distribution
No response
Additional Info
We were able to recover from it by:
Annotate the remaining objects with the "paused=true" annotation, patch MR's with
deletionPolicy: "Orphan"
(or the equivalentmanagementPolicies
) andforceDelete: false
crossplane.io/paused
annotation, which will complete the deletion in-process, without impacting the external resourcescrossplane.io/external-name
, which made it "import" the ER and get "back on track"The text was updated successfully, but these errors were encountered: