Tolerations are not propagated to nvipam controller #1172

arpitsardhana · 2024-11-19T23:56:05Z

What happened:
We required all network-operator pods to be deployed on nodes with taints. We added toleration in operator spec as well as in Toleration section for daemon sets. But These tolerations are not propagated to nvipam-controller. nvipam controller remains in pending state

What you expected to happen:
tolerations to propagate to nvipam controller also

How to reproduce it (as minimally and precisely as possible):
Deploy network operator on cluster with all nodes tainted. Add toleration in helm chart. If nvipam controller deployment is true, its pods will remain in pending state

Anything else we need to know?:

Logs:

NicClusterPolicy CR spec and state:
Output of: kubectl -n nvidia-network-operator get -A:
Network Operator version:
Logs of Network Operator controller:
Logs of the various Pods in nvidia-network-operator namespace:
Helm Configuration (if applicable):
Kubernetes' nodes information (labels, annotations and status): kubectl get node -o yaml:

Environment:

Kubernetes version (use kubectl version): N/A
Hardware configuration: N/A
- Network adapter model and firmware version:
OS (e.g: cat /etc/os-release): N/A
Kernel (e.g. uname -a):N/A
Others:N/A

The text was updated successfully, but these errors were encountered:

ykulazhenkov · 2024-11-20T06:50:43Z

Hi @arpitsardhana,

The tolerations field in the NicClusterPolicy affects only the DaemonSets. This behavior is documented in the description of the field:

network-operator/api/v1alpha1/nicclusterpolicy_types.go

Line 332 in 1acf567

Tolerations []v1.Toleration `json:"tolerations,omitempty"`

This setting does not change tolerations for controllers deployed by the network-operator, such as nv-ipam-controller and ib-kubernetes.
These controllers have the following tolerations:

network-operator/manifests/state-nv-ipam-cni/040-nv-ipam-controller.yaml

Line 72 in 1acf567

tolerations:

Currently, it is not possible to modify them.

I agree that adding support for custom tolerations and nodeAffinity for controllers could be beneficial. This feature will likely be implemented as a separate setting.

The issue has been converted to an enhancement, as the controller currently functions as expected.

arpitsardhana · 2024-11-20T18:43:11Z

@ykulazhenkov Thanks! Due to security constraints, there are situation when all nodes in cluster are required to have taints.
Having one deployment as odd one out of all pods run by network operator which does not support adding toleration felt like a bug to me. A miss rather than by design.

Hopefully we can fix it soon so that we can use network operator without any workarounds

arpitsardhana added the bug Something isn't working label Nov 19, 2024

ykulazhenkov added enhancement New feature or request and removed bug Something isn't working labels Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolerations are not propagated to nvipam controller #1172

Tolerations are not propagated to nvipam controller #1172

arpitsardhana commented Nov 19, 2024

ykulazhenkov commented Nov 20, 2024

arpitsardhana commented Nov 20, 2024

Tolerations are not propagated to nvipam controller #1172

Tolerations are not propagated to nvipam controller #1172

Comments

arpitsardhana commented Nov 19, 2024

ykulazhenkov commented Nov 20, 2024

arpitsardhana commented Nov 20, 2024