[Bug]: Cascade Deletion of S3 Bucket may cause K8s API DoS #1488

creativeChips · 2024-09-10T15:53:43Z

Is there an existing issue for this?

I have searched the existing issues

Affected Resource(s)

s3.aws.upbound.io/Bucket/v1beta1 - Bucket

Resource MRs required to reproduce the bug

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
   name: ...
spec:
  resources:
     - name: bucket
        base:
          apiVersion: s3.aws.upbound.io/v1beta1
          kind: Bucket
          spec:
            deletionPolicy: Delete
            providerConfigRef:
              name: aws-provider
            forProvider:
              region: us-east-1
              forceDestroy: true
    - name: bucket-versioning
      base:
        apiVersion: s3.aws.upbound.io/v1beta1
        kind: BucketVersioning
        spec:
          providerConfigRef:
            name: aws-provider
          forProvider:
            region: us-east-1
            bucketSelector:
              matchControllerRef: true
            versioningConfiguration:
            - status: Enabled

Steps to Reproduce

Define a crossplane role on AWS which can not delete bucket data (lacking permission s3:DeleteObjectVersion specifically).
Deploy a AWS S3 Bucket, using Crossplane, with:
2.1. versioning enabled
2.2. forceDestroy set to true
2.3 Orphan not allowed
Fill the bucket with data
Delete the managed bucket from K8s

Deleting the Bucket MR from Crossplane will require it to issue a delete request to the ER on AWS. Due to the forceDelete set to true, it will attempt to delete the stored objects first. As it lacks the permissions to delete versions, it will get into a rapid fail-retry attempts which result in filling up the Etcd mvcc database and causing DoS of the K8s api (Error from server: etcdserver: mvcc: database space exceeded).

What happened?

Our usecase involved setting up an S3 bucket, along with bucketpolicy, bucketversioning, ... in a composition.
We had tens of managed buckets, and external buckets held millions of objects.
We ran into a situation where our policy denied crossplane role itself. In parallel to this, the manifests creating the bucket resources, had been (mis)configured to delete the external resource upon the deletion of the managed resource, and to destroy the data.
When the manifests were removed from code base (in a mid-term commit while working on long term fix for our issue), the GitOps (ArgoCD) issued a delete ("prune") to the managed resources.
These deleted resources had been denied by AWS as the policy denied crossplane role itself.
When the role was enabled back (with RO permissions, no write/delete), crossplane (finalizer) reassumed control and tried to proceed with the deletion and delete the objects within the external buckets to comply to its forceDestroy configuration.
This resulted in depletion of the space of the Etcd database and in an outage for K8s (the API became unresponsive).

Investigating this after the recovery, we found that Crossplane is storing the error from AWS SDK on the bucket object and is updating the object with the new error each time the action fails, which is often because it keeps retrying, and each time it updates the resource it creates a new version in etcd, which uses space until it's compacted and defragged.

We believe the logic of retries - in general maybe, and in this case of client-side error of permissions specifically - should be revisited and suitable back-off be installed, to avoid flooding the Etcd database.

We will be happy to provide more information if needed.

Relevant Error Output Snippet

While issuing API requests to K8s we received (e.g. kubectl get pods):

Error from server: etcdserver: mvcc: database space exceeded.

The AWS provider log had many of these errors:

 AccessDenied: User: arn:aws:sts::<account id>:assumed-role/<crossplane role>/<session> is not authorized to perform: s3:DeleteObjectVersion on resource: \"arn:aws:s3:::<bucket name>/<object path>\" because no identity-based policy allows the s3:DeleteObjectVersion action  []}]", ...

Looking into etcd snapshot, on one example, we can see that crossplane changed the resource hundreds of thousands of times since creation, and thus generated as many versions of the full object in etcd

etcdctl get /registry/s3.aws.upbound.io/bucketversionings/<bucket name> -w=json | jq '.kvs[] | .version'
533825

Crossplane Version

1.16.0

Provider Version

1.9.1

Kubernetes Version

1.24.17

Kubernetes Distribution

No response

Additional Info

We were able to recover from it by:

Compact and dfrag etcd, to regain access to api
Deleting the entire crossplane system to stop the flood, then forcefully clean up remaining MR's, or
Annotate the remaining objects with the "paused=true" annotation, patch MR's with deletionPolicy: "Orphan" (or the equivalent managementPolicies) and forceDelete: false
Bring back crossplane, or remove the crossplane.io/paused annotation, which will complete the deletion in-process, without impacting the external resources
Deploy the MR's with suitable crossplane.io/external-name, which made it "import" the ER and get "back on track"

The text was updated successfully, but these errors were encountered:

creativeChips added bug Something isn't working needs:triage labels Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Cascade Deletion of S3 Bucket may cause K8s API DoS #1488

[Bug]: Cascade Deletion of S3 Bucket may cause K8s API DoS #1488

creativeChips commented Sep 10, 2024 •

edited

Loading

[Bug]: Cascade Deletion of S3 Bucket may cause K8s API DoS #1488

[Bug]: Cascade Deletion of S3 Bucket may cause K8s API DoS #1488

Comments

creativeChips commented Sep 10, 2024 • edited Loading

Is there an existing issue for this?

Affected Resource(s)

Resource MRs required to reproduce the bug

Steps to Reproduce

What happened?

Relevant Error Output Snippet

Crossplane Version

Provider Version

Kubernetes Version

Kubernetes Distribution

Additional Info

creativeChips commented Sep 10, 2024 •

edited

Loading