Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] Move ephemeral Linux containers to AWS #4317

Open
Tracked by #4313
dduportal opened this issue Sep 28, 2024 · 12 comments
Open
Tracked by #4313

[ci.jenkins.io] Move ephemeral Linux containers to AWS #4317

dduportal opened this issue Sep 28, 2024 · 12 comments

Comments

@dduportal
Copy link
Contributor

dduportal commented Sep 28, 2024

Requires #4319

Copy link

github-actions bot commented Sep 28, 2024

Take a look at these similar issues to see if there isn't already a response to your problem:

  1. 76% [ci.jenkins.io] Move ephemeral VM agents to AWS #4316

@dduportal dduportal changed the title Move ephemeral Linux containers to AWS [ci.jenkins.io] Move ephemeral Linux containers to AWS Sep 28, 2024
@dduportal dduportal added the triage Incoming issues that need review label Oct 15, 2024
@dduportal dduportal added this to the infra-team-sync-2025-01-21 milestone Jan 15, 2025
@dduportal dduportal added ci.jenkins.io kubernetes aws and removed triage Incoming issues that need review labels Jan 15, 2025
@dduportal
Copy link
Contributor Author

Update: see the first bullet of #4319 (comment)

  • Setup Jenkins namespace + SVC account + RBAC, and set up the controller VM IAM identity to map to this SVC account in order to have a credential-less setup (like it has been been with EC2 agents)

    • Don't forget to remove the kubeconfig's output from terraform once setup. Implies eventually updating the helm chart to NOT create a token (unneeded anymore)

=> all requirements from #4319 have been met (only leftover is related to the EC2 agents to ACP)

@smerle33 did already started to work on this issue: #4319 (comment), to ensure that ci.jenkins.io may access the EKS cluster cijenkinsio-agents2 without credentials using instance identity AND with the proper IAM and RBAC permissions (requires proper namespace and quotas):

@smerle33
Copy link
Contributor

The aws jenkins controller can now connect to the cijenkinsio-agents2 kubernetes cluster without credential but using AWS cli and the .kube/config file providing the cluster CA cert, server url and a aws call for the user.

we need to provide the AWS cli in the controller container and the kube/config file

@dduportal
Copy link
Contributor Author

dduportal commented Jan 17, 2025

Cherry picking comment by @smerle33: #4319 (comment):

for part 1 (jenkins namespace, service account, rbac and iam link with VM iam identity and kubernetes service account):

namespace, service account and rbac are dealt with the helm chart: jenkins-infra/helm-charts@main/charts/jenkins-kubernetes-agents

following this step, we need to provide aws cli within the container jenkins and a specific kubeconfig to handle the credential less authentication from ci.jenkins.io in aws.

@dduportal dduportal self-assigned this Jan 19, 2025
@dduportal
Copy link
Contributor Author

Update:

  • We need to set up the Karpenter Node Pools to handle the "normal" builds and the BOM build
  • I'm taking over the task above from Stephane as he's ill for 1 week (and must rest!)

@dduportal
Copy link
Contributor Author

dduportal commented Jan 20, 2025

Update:

  • I'm taking over the task above from Stephane as he's ill for 1 week (and must rest!)

Next steps:

  • Set up the Karpenter Node groups and Node Class to support both clouds (normal linux amd64 and BOM linux amd64), using spot instance by default (and fallback to on-demand)
  • We have to fix the domain name setup of aws.ci.jenkins.io if using "Direct Connection" for Kubernetes pods: we might see pods scheduled and starting, but the agent will most probably fail to start due to certificate issue. See [ci.jenkins.io] Move controller (VM) to AWS #4315 (comment)

@dduportal
Copy link
Contributor Author

  • Set up the Karpenter Node groups and Node Class to support both clouds (normal linux amd64 and BOM linux amd64), using spot instance by default (and fallback to on-demand)

Update: jenkins-infra/terraform-aws-sponsorship#105 has been deployed.

Tested with the test job that using Node labels maven-21 and maven-bom are both scheduling pods and nodes on the correct node pools with success.

Next step: moving effort to as the agents in the pod are failing to start.

@dduportal
Copy link
Contributor Author

Update:

@dduportal
Copy link
Contributor Author

@dduportal
Copy link
Contributor Author

dduportal commented Jan 24, 2025

Update: first results

This test went well on the Linux part. The Windows label was blocked as we have not applied the "label" trick from #4490 which is currently in place on (azure) ci.jenkins.io.

Ran 2 builds which were successful on both ACP part (we see a faster build 2nd time due to faster dependency resolution) and Linux 21 build.

Pod is allocated immediately (time for the node to start).

See https://aws.ci.jenkins.io/job/jenkins-infra-test-plugin/job/master/1/ and https://aws.ci.jenkins.io/job/jenkins-infra-test-plugin/job/master/2/

Initial build https://aws.ci.jenkins.io/job/bom/job/master/2/ shows:

  • The initial prep stage took around 30 min (time to load the ACP cache for the first time)
  • Karpenter did quickly find a set of available instances. Mostly Spot instances, and it packed up the agents with success
  • Problem 1 (Node Disks): The parallel stages all failed due to "Disk Insufficient Storage" errors:
Image Image Image
  • Problem 2 (Subnets): the EKS Cluster Health Advisory complained about subnets being too small top handle the load:

Image

  • NOTE: our monitoring did alert us about disk usage. Both iNodes and available space. Which is good: it means the datadog agent is working as expected!

@dduportal
Copy link
Contributor Author

  • Problem 1 (Node Disks): The parallel stages all failed due to "Disk Insufficient Storage" errors:

Update: the node disk should now be 300 Gb since jenkins-infra/terraform-aws-sponsorship#111

dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue Jan 24, 2025
…fra reports (#6144)

Related to
jenkins-infra/helpdesk#4317 (comment),

This PR adds automatic tracking of the subnets and IP configurations for
the AWS ACP LB.
Now, any changes in the subnets settings of
https://github.com/jenkins-infra/terraform-aws-sponsorship will be
tracked by `updatecli` and a PR will be created to pass these changes.

Note: we had to delete the SVC for
jenkins-infra/terraform-aws-sponsorship#112 to
succeed
@dduportal
Copy link
Contributor Author

dduportal commented Jan 24, 2025

  • Problem 2 (Subnets): the EKS Cluster Health Advisory complained about subnets being too small top handle the load:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants