Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] Create private EKS cluster with "side" services (datadog, ACP, etc.) #4319

Open
Tracked by #4313
dduportal opened this issue Sep 28, 2024 · 22 comments
Open
Tracked by #4313

Comments

@dduportal
Copy link
Contributor

dduportal commented Sep 28, 2024

We need a private EKS cluster to run ci.jenkins.io container agents.

@dduportal dduportal changed the title Move "side" services to AWS [ci.jenkins.io] Create private EKS cluster and Move "side" services to AWS Sep 28, 2024
@dduportal dduportal changed the title [ci.jenkins.io] Create private EKS cluster and Move "side" services to AWS [ci.jenkins.io] Create private EKS cluster with "side" services (datadog, ACP, etc.) Sep 28, 2024
@dduportal dduportal added this to the infra-team-sync-2024-10-01 milestone Sep 28, 2024
@dduportal dduportal removed this from the infra-team-sync-2024-10-15 milestone Oct 14, 2024
@dduportal dduportal added this to the infra-team-sync-2024-10-29 milestone Oct 15, 2024
@dduportal dduportal removed the triage Incoming issues that need review label Oct 15, 2024
@dduportal
Copy link
Contributor Author

Discussed with @smerle33:

@smerle33
Copy link
Contributor

change of usage for the module since last time we used it https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/UPGRADE-20.0.md

@smerle33
Copy link
Contributor

smerle33 commented Nov 4, 2024

We choose to deal with all the IAM usage within the private repository https://github.com/jenkins-infra/terraform-states/commit/cfd08c45dd4153d676c9223670f927d515585679
instead of giving the module user too much power.

@dduportal
Copy link
Contributor Author

Then we will add datadog that need the docker registry secrets

As per https://github.com/jenkins-infra/kubernetes-management/pull/6020/files#r1890521384, we'll start with datadog (changed since yesterday)

@dduportal
Copy link
Contributor Author

Update:

=> cluster still has 1 node but it is up and running

Next steps:

  • Add a new "applications" node group
    • Need to choose the correct sizing
    • Drain the current "tiny linux" and remove it to use applications fully instead
  • Set up cluster-autoscaler to have anti-affinity
    • Will "break" the deployment (only 1 replica running on the unique node)
    • Good test of its HA mode: it should keep working, trigger a scale-up, and auto-heal. Fallback to manual scale up if it breaks
  • Set up CoreDNS to have anti-affinity
  • Export node labels and taints in the JSON export
  • Add datadog Helm release
    • Involve retrieving node labels from Terraform JSON export and tolerations to specify nodeSelectors and node tolerations
  • Add EBS addon to support ACP
  • Add ACP Helm release
    • Involve retrieving node labels from Terraform JSON export and tolerations to specify nodeSelectors and node tolerations
    • Decide which volume provisiniong pattern to use as ACP is a statefulset:
      • dynamic (PV/PVC in the helm chart) or static (Terraform defined + JSON export)
      • Topology awareness (availability zone constraint)
      • Might need to update the node group "application" to spread across the 2 subnets (distincts AZs) provided to EKS and set up 1 replica per AZ
  • Track missing elements with updatecli (search for TODO in the Terraform project)
  • Add the 2 new node groups for agents and bom-agents + export their labels/taints
  • Set up the rest of the helm charts
  • Optional: can we use instance identity to run the cluster auth (like we do for ec2) instead of creating a svc account?

@dduportal
Copy link
Contributor Author

Next steps:

  • Add a new "applications" node group
    • Need to choose the correct sizing
    • Drain the current "tiny linux" and remove it to use applications fully instead
      ...
  • Export node labels and taints in the JSON export

jenkins-infra/terraform-aws-sponsorship#71

@dduportal
Copy link
Contributor Author

Set up cluster-autoscaler to have anti-affinity
...
Set up CoreDNS to have anti-affinity

As per the cluster-autoscaler and coredns recommendations, we should not do this as it may constrain the cluster when operating upgrades. We shall let the scheduler do its job instead (as in EKS, like AKS, it relaxes constraints when possible)

@dduportal
Copy link
Contributor Author

Export node labels and taints in the JSON export

https://reports.jenkins.io/jenkins-infra-data-reports/aws-sponsorship.json => LGTM

@dduportal
Copy link
Contributor Author

dduportal commented Dec 23, 2024

Update: had to re-create the cluster to ensure a successful bootstrap. There was a lot of node creation attempts in NotReady due to many factors:

  • Network ACL were blocking some requests to the Amazon ECRs hosting some of their addons (coredns) and the cluster-autoscaler image
  • Adding tolerations to the kube proxy and CNI adds on did messed up their configuration (most probably the addon "preserve/overwrite" system that I misunderstood).
    • But it is REQUIRED for CoreDNS addon and cluster-autoscaler...

Related code changes:

dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue Dec 23, 2024
…as unique release (#6020)

as per
jenkins-infra/helpdesk#4319 (comment)

starting adding the new EKS cluster to infra.ci kubernetes-management

kubeconfig added as secrets here
jenkins-infra/charts-secrets@a24b1ec
and datadog api key here
jenkins-infra/charts-secrets@c7505e8

need #6021

⚠️ BEFORE merging this PR we need to create the `datadog` namespace
using : 

```
kubectl config use-context arn:aws:eks:us-east-2:326712726440:cluster/cijenkinsio-agents-2
kubectl create ns datadog
```
 

splitting in multiple PR:

this one is with the minimum release possible, so only datadog as a
start
@dduportal
Copy link
Contributor Author

dduportal commented Dec 23, 2024

Annnnd datadog is installed: jenkins-infra/kubernetes-management#6020 Merry Christmas!

@dduportal dduportal self-assigned this Jan 3, 2025
@dduportal
Copy link
Contributor Author

dduportal commented Jan 3, 2025

Update: starting work for installing ACP.

First set of working hypothesis for the initial deployment: Internal SVC and standard (gp3) EBS persistence. Goal is to have an initial working deployment which can be used internally to the EKS cluster (e.g. by container agents).

Second set of hypothesis: Use a "private LB". Goal is to allow EC2 VM agents to access ACP, without opening it publicly.

We'll have to monitor ACP metrics once we'll start using it (mainly CPU and EBS IOPS) to see if it does not need more.

@dduportal
Copy link
Contributor Author

dduportal commented Jan 3, 2025

First set of working hypothesis for the initial deployment: Internal SVC and EBS persistence

Task list:

  • Enable CSI add-on (ref. https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html)
    • Require to set up properties
  • Add EBS "premium" (start with gp3) storage classes for each zone defined for the EKS cluster
  • Set up initial ACP installation (with proper setup)
    • Toleration and taints should not need to change
    • Double check the DNS resolver hostname

@dduportal
Copy link
Contributor Author

First set of working hypothesis for the initial deployment: Internal SVC and EBS persistence

ACP is now installed (jenkins-infra/terraform-aws-sponsorship#74, jenkins-infra/terraform-aws-sponsorship#75, jenkins-infra/kubernetes-management#6073)

@dduportal
Copy link
Contributor Author

dduportal commented Jan 6, 2025

Next steps (all elements have the same priority):

@smerle33
Copy link
Contributor

smerle33 commented Jan 6, 2025

for part 1 (jenkins namespace, service account, rbac and iam link with VM iam identity and kubernetes service account):

namespace, service account and rbac are dealt with the helm chart: https://github.com/jenkins-infra/helm-charts/tree/main/charts/jenkins-kubernetes-agents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants