Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Cascade Deletion of S3 Bucket may cause K8s API DoS #1488

Open
1 task done
creativeChips opened this issue Sep 10, 2024 · 0 comments
Open
1 task done

[Bug]: Cascade Deletion of S3 Bucket may cause K8s API DoS #1488

creativeChips opened this issue Sep 10, 2024 · 0 comments
Labels
bug Something isn't working needs:triage

Comments

@creativeChips
Copy link

creativeChips commented Sep 10, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Affected Resource(s)

s3.aws.upbound.io/Bucket/v1beta1 - Bucket

Resource MRs required to reproduce the bug

apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
   name: ...
spec:
  resources:
     - name: bucket
        base:
          apiVersion: s3.aws.upbound.io/v1beta1
          kind: Bucket
          spec:
            deletionPolicy: Delete
            providerConfigRef:
              name: aws-provider
            forProvider:
              region: us-east-1
              forceDestroy: true
    - name: bucket-versioning
      base:
        apiVersion: s3.aws.upbound.io/v1beta1
        kind: BucketVersioning
        spec:
          providerConfigRef:
            name: aws-provider
          forProvider:
            region: us-east-1
            bucketSelector:
              matchControllerRef: true
            versioningConfiguration:
            - status: Enabled

Steps to Reproduce

  1. Define a crossplane role on AWS which can not delete bucket data (lacking permission s3:DeleteObjectVersion specifically).
  2. Deploy a AWS S3 Bucket, using Crossplane, with:
    2.1. versioning enabled
    2.2. forceDestroy set to true
    2.3 Orphan not allowed
  3. Fill the bucket with data
  4. Delete the managed bucket from K8s

Deleting the Bucket MR from Crossplane will require it to issue a delete request to the ER on AWS. Due to the forceDelete set to true, it will attempt to delete the stored objects first. As it lacks the permissions to delete versions, it will get into a rapid fail-retry attempts which result in filling up the Etcd mvcc database and causing DoS of the K8s api (Error from server: etcdserver: mvcc: database space exceeded).

What happened?

Our usecase involved setting up an S3 bucket, along with bucketpolicy, bucketversioning, ... in a composition.
We had tens of managed buckets, and external buckets held millions of objects.
We ran into a situation where our policy denied crossplane role itself. In parallel to this, the manifests creating the bucket resources, had been (mis)configured to delete the external resource upon the deletion of the managed resource, and to destroy the data.
When the manifests were removed from code base (in a mid-term commit while working on long term fix for our issue), the GitOps (ArgoCD) issued a delete ("prune") to the managed resources.
These deleted resources had been denied by AWS as the policy denied crossplane role itself.
When the role was enabled back (with RO permissions, no write/delete), crossplane (finalizer) reassumed control and tried to proceed with the deletion and delete the objects within the external buckets to comply to its forceDestroy configuration.
This resulted in depletion of the space of the Etcd database and in an outage for K8s (the API became unresponsive).

Investigating this after the recovery, we found that Crossplane is storing the error from AWS SDK on the bucket object and is updating the object with the new error each time the action fails, which is often because it keeps retrying, and each time it updates the resource it creates a new version in etcd, which uses space until it's compacted and defragged.

We believe the logic of retries - in general maybe, and in this case of client-side error of permissions specifically - should be revisited and suitable back-off be installed, to avoid flooding the Etcd database.

We will be happy to provide more information if needed.

Relevant Error Output Snippet

  • While issuing API requests to K8s we received (e.g. kubectl get pods):
Error from server: etcdserver: mvcc: database space exceeded.
  • The AWS provider log had many of these errors:
 AccessDenied: User: arn:aws:sts::<account id>:assumed-role/<crossplane role>/<session> is not authorized to perform: s3:DeleteObjectVersion on resource: \"arn:aws:s3:::<bucket name>/<object path>\" because no identity-based policy allows the s3:DeleteObjectVersion action  []}]", ...
  • Looking into etcd snapshot, on one example, we can see that crossplane changed the resource hundreds of thousands of times since creation, and thus generated as many versions of the full object in etcd
etcdctl get /registry/s3.aws.upbound.io/bucketversionings/<bucket name> -w=json | jq '.kvs[] | .version'
533825

Crossplane Version

1.16.0

Provider Version

1.9.1

Kubernetes Version

1.24.17

Kubernetes Distribution

No response

Additional Info

We were able to recover from it by:

  1. Compact and dfrag etcd, to regain access to api
  2. Deleting the entire crossplane system to stop the flood, then forcefully clean up remaining MR's, or
    Annotate the remaining objects with the "paused=true" annotation, patch MR's with deletionPolicy: "Orphan" (or the equivalent managementPolicies) and forceDelete: false
  3. Bring back crossplane, or remove the crossplane.io/paused annotation, which will complete the deletion in-process, without impacting the external resources
  4. Deploy the MR's with suitable crossplane.io/external-name, which made it "import" the ER and get "back on track"
@creativeChips creativeChips added bug Something isn't working needs:triage labels Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:triage
Projects
None yet
Development

No branches or pull requests

1 participant