Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Leftover resources when EKS cluster creations fails unexpectedly #1173

Closed
orfeas-k opened this issue Dec 3, 2024 · 1 comment · Fixed by #1174
Closed

CI: Leftover resources when EKS cluster creations fails unexpectedly #1173

orfeas-k opened this issue Dec 3, 2024 · 1 comment · Fixed by #1174
Labels
bug Something isn't working

Comments

@orfeas-k
Copy link
Contributor

orfeas-k commented Dec 3, 2024

Bug Description

As observed here https://github.com/canonical/bundle-kubeflow/actions/runs/12130614512/job/33821313708#step:11:42, the cluster failed due to #1171, but the CF stack was created and never deleted (because it's being deleted by eksctl delete-cluster under normal circumstances)

To Reproduce

Create cluster and exit unexpectedly

Environment

EKS 1.29

Relevant Log Output

The following logs are when the issue is created

2024-12-03 00:39:55 [ℹ]  eksctl version 0.196.0
2024-12-03 00:39:55 [ℹ]  using region eu-central-1
2024-12-03 00:39:55 [ℹ]  subnets for eu-central-1a - public:192.168.0.0/19 private:192.168.64.0/19
2024-12-03 00:39:55 [ℹ]  subnets for eu-central-1b - public:192.168.32.0/19 private:192.168.96.0/19
2024-12-03 00:39:55 [ℹ]  nodegroup "ng-d06bd84e" will use "ami-015db95d8173273e9" [Ubuntu2004/1.29]
2024-12-03 00:39:55 [ℹ]  using Kubernetes version 1.29
2024-12-03 00:39:55 [ℹ]  creating EKS cluster "kubeflow-test-latest" in "eu-central-1" region with managed nodes
2024-12-03 00:39:55 [ℹ]  1 nodegroup (ng-d06bd84e) was included (based on the include/exclude rules)
2024-12-03 00:39:55 [ℹ]  will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s)
2024-12-03 00:39:55 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=eu-central-1 --cluster=kubeflow-test-latest'
2024-12-03 00:39:55 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "kubeflow-test-latest" in "eu-central-1"
2024-12-03 00:39:55 [ℹ]  CloudWatch logging will not be enabled for cluster "kubeflow-test-latest" in "eu-central-1"
2024-12-03 00:39:55 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=eu-central-1 --cluster=kubeflow-test-latest'
2024-12-03 00:39:55 [ℹ]  default addons coredns, vpc-cni, kube-proxy were not specified, will install them as EKS addons
2024-12-03 00:39:55 [ℹ]  
2 sequential tasks: { create cluster control plane "kubeflow-test-latest", 
    2 sequential sub-tasks: { 
        2 sequential sub-tasks: { 
            1 task: { create addons },
            wait for control plane to become ready,
        },
        create managed nodegroup "ng-d06bd84e",
    } 
}
2024-12-03 00:39:55 [ℹ]  building cluster stack "eksctl-kubeflow-test-latest-cluster"
2024-12-03 00:39:56 [ℹ]  deploying stack "eksctl-kubeflow-test-latest-cluster"
2024-12-03 00:40:26 [ℹ]  waiting for CloudFormation stack "eksctl-kubeflow-test-latest-cluster"
2024-12-03 00:40:27 [✖]  unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-kubeflow-test-latest-cluster"
2024-12-03 00:40:27 [✖]  unexpected status "ROLLBACK_IN_PROGRESS" while waiting for CloudFormation stack "eksctl-kubeflow-test-latest-cluster"
2024-12-03 00:40:27 [ℹ]  fetching stack events in attempt to troubleshoot the root cause of the failure
2024-12-03 00:40:27 [!]  AWS::EC2::EIP/NATIP: DELETE_IN_PROGRESS
Error: failed to create cluster "kubeflow-test-latest"
2024-12-03 00:40:27 [!]  AWS::IAM::Role/ServiceRole: DELETE_IN_PROGRESS
2024-12-03 00:40:27 [✖]  AWS::IAM::Role/ServiceRole: CREATE_FAILED – "Resource creation cancelled"
2024-12-03 00:40:27 [✖]  AWS::EC2::EIP/NATIP: CREATE_FAILED – "Resource creation cancelled"
2024-12-03 00:40:27 [✖]  AWS::EC2::InternetGateway/InternetGateway: CREATE_FAILED – "Resource handler returned message: \"The maximum number of internet gateways has been reached. (Service: Ec2, Status Code: 400, Request ID: a97f20de-a1fa-4fd2-8a2f-f83ef2ccfaf9)\" (RequestToken: 933fc990-1bed-f543-4a69-ac24808072f5, HandlerErrorCode: ServiceLimitExceeded)"
2024-12-03 00:40:27 [✖]  AWS::EC2::VPC/VPC: CREATE_FAILED – "Resource handler returned message: \"The maximum number of VPCs has been reached. (Service: Ec2, Status Code: 400, Request ID: d75bf6ea-669a-4723-b056-cfa10de61ad8)\" (RequestToken: 99f79551-eabf-93fe-f5f8-4a65e599edb6, HandlerErrorCode: GeneralServiceException)"
2024-12-03 00:40:27 [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2024-12-03 00:40:27 [ℹ]  to cleanup resources, run 'eksctl delete cluster --region=eu-central-1 --name=kubeflow-test-latest'
2024-12-03 00:40:27 [✖]  ResourceNotReady: failed waiting for successful resource state

The following logs are when you rerun (and the Cloudformation already exists)

2024-12-03 09:38:09 [ℹ]  eksctl version 0.197.0
2024-12-03 09:38:09 [ℹ]  using region eu-central-1
2024-12-03 09:38:09 [ℹ]  subnets for eu-central-1a - public:192.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/12130614512/job/33838826606#step:11:17)8.0.0/19 private:192.168.64.0/19
2024-12-03 09:38:09 [ℹ]  subnets for eu-central-1b - public:192.168.32.0/19 private:192.168.96.0/19
2024-12-03 09:38:09 [ℹ]  nodegroup "ng-d06bd84e" will use "ami-015db95d8[17](https://github.com/canonical/bundle-kubeflow/actions/runs/12130614512/job/33838826606#step:11:18)3273e9" [Ubuntu2004/1.29]
2024-12-03 09:38:10 [ℹ]  using Kubernetes version 1.29
2024-12-03 09:38:10 [ℹ]  creating EKS cluster "kubeflow-test-latest" in "eu-central-1" region with managed nodes
2024-12-03 09:38:10 [ℹ]  1 nodegroup (ng-d06bd84e) was included (based on the include/exclude rules)
2024-12-03 09:38:10 [ℹ]  will create a CloudFormation stack for cluster itself and 1 managed nodegroup stack(s)
[20](https://github.com/canonical/bundle-kubeflow/actions/runs/12130614512/job/33838826606#step:11:21)24-12-03 09:38:10 [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=eu-central-1 --cluster=kubeflow-test-latest'
2024-12-03 09:38:10 [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "kubeflow-test-latest" in "eu-central-1"
2024-12-03 09:38:10 [ℹ]  CloudWatch logging will not be enabled for cluster "kubeflow-test-latest" in "eu-central-1"
2024-12-03 09:38:10 [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --enable-types={SPECIFY-YOUR-LOG-TYPES-HERE (e.g. all)} --region=eu-central-1 --cluster=kubeflow-test-latest'
2024-12-03 09:38:10 [ℹ]  default addons vpc-cni, kube-proxy, coredns were not specified, will install them as EKS addons
2024-12-03 09:38:10 [ℹ]  
2 sequential tasks: { create cluster control plane "kubeflow-test-latest", 
    2 sequential sub-tasks: { 
        2 sequential sub-tasks: { 
            1 task: { create addons },
            wait for control plane to become ready,
        },
        create managed nodegroup "ng-d06bd84e",
    } 
}
2024-12-03 09:38:10 [ℹ]  building cluster stack "eksctl-kubeflow-test-latest-cluster"
Error: failed to create cluster "kubeflow-test-latest"
2024-12-03 09:38:10 [!]  1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console
2024-12-03 09:38:10 [ℹ]  to cleanup resources, run 'eksctl delete cluster --region=eu-central-1 --name=kubeflow-test-latest'
2024-12-03 09:38:10 [✖]  creating CloudFormation stack "eksctl-kubeflow-test-latest-cluster": operation error CloudFormation: CreateStack, https response error StatusCode: 400, RequestID: 2a8e4[22](https://github.com/canonical/bundle-kubeflow/actions/runs/12130614512/job/33838826606#step:11:23)4-5f7b-4f7b-881b-a99e0e15d5b0, AlreadyExistsException: Stack [eksctl-kubeflow-test-latest-cluster] already exists

Additional Context

No response

@orfeas-k orfeas-k added the bug Something isn't working label Dec 3, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6637.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant