-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Default to Azure Linux images #4832
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retitle [WIP] Default to Azure Linux images This needs new unit tests and has at least one |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4832 +/- ##
==========================================
- Coverage 62.19% 62.08% -0.11%
==========================================
Files 201 201
Lines 16878 16910 +32
==========================================
+ Hits 10497 10499 +2
- Misses 5591 5621 +30
Partials 790 790 ☔ View full report in Codecov by Sentry. |
defer done() | ||
|
||
// First try Azure Linux, then Ubuntu. | ||
defaultImage, err := s.GetDefaultAzureLinuxImage(ctx, location, k8sVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can add a couple of API calls to this common code path if it has to fall back to Ubuntu. But they all ultimately call getSKUAndVersion
which implements a cache, so in practice it shouldn't cause many new round-trips.
8b5fb2d
to
8fc1c8e
Compare
8fc1c8e
to
1b5fbec
Compare
The current problem with this PR is that Calico never comes all the way up. The control-plane node and the first worker node have all their calico pods come up, but the calico-node pod on any subsequent worker nodes will be stuck. No logs are emitted, and % KUBECONFIG=k.conf kubectl logs -n calico-system calico-node-cwqkn
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
Error from server: Get "https://10.1.0.4:10250/containerLogs/calico-system/calico-node-cwqkn/calico-node": dial tcp 10.1.0.4:10250: i/o timeout
?1 cluster-api-provider-azure % KUBECONFIG=k.conf kubectl describe pod -n calico-system calico-node-cwqkn
Name: calico-node-cwqkn
Namespace: calico-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: calico-node
Node: default-12994-md-0-6g6kz-sc4lz/10.1.0.4
Start Time: Mon, 03 Jun 2024 11:06:25 -0600
Labels: app.kubernetes.io/name=calico-node
controller-revision-hash=5c8fc7b67d
k8s-app=calico-node
pod-template-generation=1
Annotations: hash.operator.tigera.io/cni-config: 1d49cc679bcf7605c0da8c68a653470b79889bb3
hash.operator.tigera.io/system: bb4746872201725da2dea19756c475aa67d9c1e9
hash.operator.tigera.io/tigera-ca-private: 0e93a8ddcb650aeeaa893b4ce2186dfcd00d2c82
Status: Running
IP: 10.1.0.4
IPs:
IP: 10.1.0.4
Controlled By: DaemonSet/calico-node
Init Containers:
flexvol-driver:
Container ID: containerd://40d1e4414a38a2d1027024f52c1b652ff9444ac4ab6c67c9091708486f7106cc
Image: mcr.microsoft.com/oss/calico/pod2daemon-flexvol:v3.26.1
Image ID: mcr.microsoft.com/oss/calico/pod2daemon-flexvol@sha256:7e51c338e4201975ee34610c15ae5c303fbefe98b40528a2ff22758de376936d
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 03 Jun 2024 11:06:34 -0600
Finished: Mon, 03 Jun 2024 11:06:34 -0600
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8fzhb (ro)
install-cni:
Container ID: containerd://9c12174c1c8d4f76caabf03be6a8814f3f0d0f67fe65306183a990067bf9fcca
Image: mcr.microsoft.com/oss/calico/cni:v3.26.1
Image ID: mcr.microsoft.com/oss/calico/cni@sha256:7eb740f75b78c3614ab31cc8dd8a40e270acb23c9ac6a82faa7d8427fbd2a35e
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
Command:
/opt/cni/bin/install
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 03 Jun 2024 11:07:34 -0600
Finished: Mon, 03 Jun 2024 11:07:38 -0600
Ready: True
Restart Count: 1
Environment:
CNI_CONF_NAME: 10-calico.conflist
SLEEP: false
CNI_NET_DIR: /etc/cni/net.d
CNI_NETWORK_CONFIG: <set to the key 'config' of config map 'cni-config'> Optional: false
KUBERNETES_SERVICE_HOST: 10.96.0.1
KUBERNETES_SERVICE_PORT: 443
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8fzhb (ro)
Containers:
calico-node:
Container ID: containerd://1e30ff72f8aed8a3cd6b5d161b0e7ce1d8a0599257ac8418a2adca53fa004fa4
Image: mcr.microsoft.com/oss/calico/node:v3.26.1
Image ID: mcr.microsoft.com/oss/calico/node@sha256:e3cacb61880218016d18dda7c63801610face22fc0bd39bdedb9d975a7963b11
Port: <none>
Host Port: <none>
SeccompProfile: RuntimeDefault
State: Running
Started: Mon, 03 Jun 2024 11:07:46 -0600
Ready: False
Restart Count: 0
Liveness: http-get http://localhost:9099/liveness delay=0s timeout=10s period=10s #success=1 #failure=3
Readiness: exec [/bin/calico-node -felix-ready] delay=0s timeout=5s period=10s #success=1 #failure=3
Environment:
DATASTORE_TYPE: kubernetes
WAIT_FOR_DATASTORE: true
CLUSTER_TYPE: k8s,operator
CALICO_DISABLE_FILE_LOGGING: false
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_HEALTHENABLED: true
FELIX_HEALTHPORT: 9099
NODENAME: (v1:spec.nodeName)
NAMESPACE: calico-system (v1:metadata.namespace)
FELIX_TYPHAK8SNAMESPACE: calico-system
FELIX_TYPHAK8SSERVICENAME: calico-typha
FELIX_TYPHACAFILE: /etc/pki/tls/certs/tigera-ca-bundle.crt
FELIX_TYPHACERTFILE: /node-certs/tls.crt
FELIX_TYPHAKEYFILE: /node-certs/tls.key
FIPS_MODE_ENABLED: false
FELIX_TYPHACN: typha-server
CALICO_MANAGE_CNI: true
CALICO_IPV4POOL_CIDR: 192.168.0.0/16
CALICO_IPV4POOL_VXLAN: Always
CALICO_IPV4POOL_BLOCK_SIZE: 26
CALICO_IPV4POOL_NODE_SELECTOR: all()
CALICO_IPV4POOL_DISABLE_BGP_EXPORT: false
FELIX_VXLANMTU: 1350
FELIX_WIREGUARDMTU: 1350
CALICO_NETWORKING_BACKEND: vxlan
IP: autodetect
IP_AUTODETECTION_METHOD: first-found
IP6: none
FELIX_IPV6SUPPORT: false
KUBERNETES_SERVICE_HOST: 10.96.0.1
KUBERNETES_SERVICE_PORT: 443
Mounts:
/etc/pki/tls/cert.pem from tigera-ca-bundle (ro,path="ca-bundle.crt")
/etc/pki/tls/certs from tigera-ca-bundle (ro)
/host/etc/cni/net.d from cni-net-dir (rw)
/lib/modules from lib-modules (ro)
/node-certs from node-certs (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/calico from var-lib-calico (rw)
/var/log/calico/cni from cni-log-dir (rw)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8fzhb (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
tigera-ca-bundle:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: tigera-ca-bundle
Optional: false
node-certs:
Type: Secret (a volume populated by a Secret)
SecretName: node-certs
Optional: false
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
cni-log-dir:
Type: HostPath (bare host directory volume)
Path: /var/log/calico/cni
HostPathType:
flexvol-driver-host:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
HostPathType: DirectoryOrCreate
kube-api-access-8fzhb:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoSchedule op=Exists
:NoExecute op=Exists
CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m1s default-scheduler Successfully assigned calico-system/calico-node-cwqkn to default-12994-md-0-6g6kz-sc4lz
Normal Pulling 2m57s kubelet Pulling image "mcr.microsoft.com/oss/calico/pod2daemon-flexvol:v3.26.1"
Normal Pulled 2m51s kubelet Successfully pulled image "mcr.microsoft.com/oss/calico/pod2daemon-flexvol:v3.26.1" in 2.473s (5.613s including waiting)
Normal Created 2m51s kubelet Created container flexvol-driver
Normal Started 2m51s kubelet Started container flexvol-driver
Normal Pulling 2m47s kubelet Pulling image "mcr.microsoft.com/oss/calico/cni:v3.26.1"
Normal Pulled 2m36s kubelet Successfully pulled image "mcr.microsoft.com/oss/calico/cni:v3.26.1" in 10.698s (10.698s including waiting)
Normal Created 111s (x2 over 2m36s) kubelet Created container install-cni
Normal Started 111s (x2 over 2m36s) kubelet Started container install-cni
Normal Pulled 111s kubelet Container image "mcr.microsoft.com/oss/calico/cni:v3.26.1" already present on machine
Normal Pulling 107s kubelet Pulling image "mcr.microsoft.com/oss/calico/node:v3.26.1"
Normal Pulled 99s kubelet Successfully pulled image "mcr.microsoft.com/oss/calico/node:v3.26.1" in 7.375s (7.375s including waiting)
Normal Created 99s kubelet Created container calico-node
Normal Started 99s kubelet Started container calico-node
Warning Unhealthy 99s kubelet Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
W0603 17:07:46.725894 24 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
Warning Unhealthy 98s kubelet Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
W0603 17:07:47.572418 45 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
Warning Unhealthy 98s kubelet Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
W0603 17:07:47.726779 56 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
Warning Unhealthy 88s kubelet Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
W0603 17:07:57.579934 153 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
Warning Unhealthy 78s kubelet Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
W0603 17:08:07.570650 165 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
Warning Unhealthy 68s kubelet Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503 |
All pods are healthy but the one, which never will be: % KUBECONFIG=k.conf kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
calico-apiserver calico-apiserver-58f97bc954-cdxrw 1/1 Running 0 4m37s
calico-apiserver calico-apiserver-58f97bc954-v4g7x 1/1 Running 0 4m37s
calico-system calico-kube-controllers-5696b6f5cd-8pnkv 1/1 Running 0 5m32s
calico-system calico-node-82ktb 1/1 Running 0 5m32s
calico-system calico-node-cwqkn 0/1 Running 0 4m21s
calico-system calico-node-t2bl7 1/1 Running 0 3m56s
calico-system calico-typha-5768f775d4-797td 1/1 Running 1 (3m11s ago) 3m53s
calico-system calico-typha-5768f775d4-nhxpv 1/1 Running 0 5m32s
calico-system csi-node-driver-qn4b8 2/2 Running 0 3m56s
calico-system csi-node-driver-tvjrl 2/2 Running 0 5m32s
calico-system csi-node-driver-wx87n 2/2 Running 0 4m21s
kube-system cloud-controller-manager-85f4c7cd6-5k5z7 1/1 Running 0 6m18s
kube-system cloud-node-manager-kwrb2 1/1 Running 0 3m56s
kube-system cloud-node-manager-sf5rw 1/1 Running 0 6m18s
kube-system cloud-node-manager-wvqc9 1/1 Running 0 4m21s
kube-system coredns-5dd5756b68-4ckr2 1/1 Running 0 6m21s
kube-system coredns-5dd5756b68-scg4j 1/1 Running 0 6m21s
kube-system etcd-default-12994-control-plane-gwlg6 1/1 Running 0 6m21s
kube-system kube-apiserver-default-12994-control-plane-gwlg6 1/1 Running 0 6m21s
kube-system kube-controller-manager-default-12994-control-plane-gwlg6 1/1 Running 0 6m21s
kube-system kube-proxy-q4gq9 1/1 Running 0 3m56s
kube-system kube-proxy-xbtpd 1/1 Running 0 6m21s
kube-system kube-proxy-zrq9f 1/1 Running 0 4m21s
kube-system kube-scheduler-default-12994-control-plane-gwlg6 1/1 Running 0 6m21s
tigera-operator tigera-operator-776f4dcbf5-rcrbn 1/1 Running 1 (5m36s ago) 6m10s
% KUBECONFIG=k.conf kubectl logs -n calico-system calico-node-cwqkn
Defaulted container "calico-node" out of: calico-node, flexvol-driver (init), install-cni (init)
Error from server: Get "https://10.1.0.4:10250/containerLogs/calico-system/calico-node-cwqkn/calico-node": dial tcp 10.1.0.4:10250: i/o timeout |
1b5fbec
to
7a18d1f
Compare
@mboersma: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/close See #5223 instead. I don't think we want to make Azure Linux the default, at least not yet. It would break a lot of users' assumptions in |
@mboersma: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Changes the default node image selection logic to prefer Azure Linux (aka Mariner) images, falling back to Ubuntu if no AL image is found for the Kubernetes version required.
This should speed up provisioning a bit, as well as align CAPZ better with Azure service recommendations and security intitiatives.
Which issue(s) this PR fixes:
Fixes #4828
See also kubernetes-sigs/image-builder#1465
Special notes for your reviewer:
TODOs:
Release note: