Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GA support for Windows hostprocess containers #7857

Merged
merged 16 commits into from
Sep 15, 2023

Conversation

coutinhop
Copy link
Contributor

@coutinhop coutinhop commented Jul 11, 2023

Build on the changes from #7260:

  • Add a Windows cni-plugin hostprocess image, which will install the cni binaries and config on the host.
  • Add a Windows node hostprocess image, which no longer needs to match the hosts' Windows version

Manually tested with an operator image from tigera/operator#2732 and a VXLAN setup.

Many thanks to @jsturtevant and @fabi200123

Description

Related issues/PRs

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Add support for Windows host process containers (HPC), with Calico for Windows daemonsets managed by kubernetes.

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@coutinhop coutinhop added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Jul 11, 2023
@coutinhop coutinhop added this to the Calico v3.27.0 milestone Jul 11, 2023
@coutinhop coutinhop requested a review from a team as a code owner July 11, 2023 22:56
@coutinhop coutinhop self-assigned this Jul 11, 2023
coutinhop added a commit to coutinhop/operator that referenced this pull request Jul 11, 2023
- Add windows calico-node daemonset with hostprocess containers.
- Add windows kube-proxy daemonset using sig-windows hostprocess
  container image.
- Add optional (required only for windows) fields KUBERNETES_SERVICE_CIDR
  and KUBERNETES_DNS_SERVERS to kubernetes-services-endpoint configmap.

Manually tested with a clean installation and VXLAN setup and images
from projectcalico/calico#7857. Still a work
in progress.

Pending:
- Tests for windows calico-node rendering (pkg/render/windows_test.go)
- Possibly make CNI BIN and CONF dirs configurable for Windows in the
  installation
- Possibly roll out our own kube-proxy image instead of using the one
  from sig-windows
- Possibly adding
coutinhop added a commit to coutinhop/operator that referenced this pull request Jul 11, 2023
- Add windows calico-node daemonset with hostprocess containers.
- Add windows kube-proxy daemonset using sig-windows hostprocess
  container image.
- Add optional (required only for windows) fields KUBERNETES_SERVICE_CIDR
  and KUBERNETES_DNS_SERVERS to kubernetes-services-endpoint configmap.

Manually tested with a clean installation and VXLAN setup and images
from projectcalico/calico#7857. Still a work
in progress.

Pending:
- Tests for windows calico-node rendering (pkg/render/windows_test.go)
- Possibly make CNI BIN and CONF dirs configurable for Windows in the
  installation
- Possibly roll out our own kube-proxy image instead of using the one
  from sig-windows
- Possibly adding an initContainer to clean up "old" style non-HPC
  Calico installations before proceeding
coutinhop added a commit to coutinhop/operator that referenced this pull request Jul 11, 2023
- Add windows calico-node daemonset with hostprocess containers.
- Add windows kube-proxy daemonset using sig-windows hostprocess
  container image.
- Add optional (required only for windows) fields KUBERNETES_SERVICE_CIDR
  and KUBERNETES_DNS_SERVERS to kubernetes-services-endpoint configmap.

Manually tested with a clean installation and VXLAN setup and images
from projectcalico/calico#7857. Still a work
in progress.

Pending:
- Tests for windows calico-node rendering (pkg/render/windows_test.go)
- Possibly make CNI BIN and CONF dirs configurable for Windows in the
  installation
- Possibly roll out our own kube-proxy image instead of using the one
  from sig-windows
- Possibly adding an initContainer to clean up "old" style non-HPC
  Calico installations before proceeding
// In the case of Windows HostProcess containers this prepends the CONTAINER_SANDBOX_MOUNT_POINT env variable
// for other operating systems or if the sandbox env variable is not set it returns the standard mount points
// see https://kubernetes.io/docs/tasks/configure-pod-container/create-hostprocess-pod/#volume-mounts
func getHostPath(path string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you only support Containerd 1.7+ this shouldn't be needed and would clean up this section of code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so then just referring to the paths as /host/foo would work on Windows without any need for change? Thanks, I hadn't tried that

Copy link
Contributor

@jsturtevant jsturtevant Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if they are volumemounts to the /host/ location that should work. This is why the In-Cluster config now works 🚀

Copy link
Contributor Author

@coutinhop coutinhop Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsturtevant I got rid of this getHostPath() function on my latest commit, though I still needed to use the $CONTAINER_SANDBOX_MOUNT_POINT env var when referring to the binaries inside the container, which I thought wouldn't be needed. Will tag you on the relevant chunk of code...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick update: I ended up implementing support for containerd v1.6, so we're using $CONTAINER_SANDBOX_MOUNT_POINT just about everywhere that's necessary. Everything's working on my tests so far, and that way we can support a containerd version that's widely adopted right now...

Comment on lines +200 to +204
containerBinDir := "/opt/cni/bin/"
// The binaries dir in the container needs to be prepended by the CONTAINER_SANDBOX_MOUNT_POINT env var on Windows Host Process Containers
// see https://kubernetes.io/docs/tasks/configure-pod-container/create-hostprocess-pod/#containerd-v1-7-and-greater
if runtime.GOOS == "windows" {
containerBinDir = os.Getenv("CONTAINER_SANDBOX_MOUNT_POINT") + "/" + containerBinDir
Copy link
Contributor Author

@coutinhop coutinhop Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsturtevant I still needed to use the $CONTAINER_SANDBOX_MOUNT_POINT env var when referring to the binaries inside the container here, which from what I read on https://kubernetes.io/docs/tasks/configure-pod-container/create-hostprocess-pod/#containerd-v1-7-and-greater shouldn't be needed, right? Could you please take a look and let me know if I missed something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the yaml look like for your deployment? I don't see it in the PR here. Could it be that the Working directory is set to something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've mostly been working with the tigera operator with the changes from this PR: tigera/operator#2732, but the deployment also works with this manifest: https://gist.github.com/coutinhop/7528503be5d0aeb22001d69d82fa61ed

No working dir on the install-cni initContainer, but I did include $env:CONTAINER_SANDBOX_MOUNT_POINT in the path for command in it, could that cause issues?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No working dir on the install-cni initContainer, but I did include $env:CONTAINER_SANDBOX_MOUNT_POINT in the path for command in it, could that cause issues?

yes I think that would do it. could you try with something like command: ["opt/cni/bin/install.exe"] or /opt/cni/bin/install.exe depending on where those files live?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've tried it (removing mentions to $env:CONTAINER_SANDBOX_MOUNT_POINT), but I got

'/opt/cni/bin/install.exe' is not recognized as an internal or external command

I see that the yaml for kube-proxy also still uses $env:CONTAINER_SANDBOX_MOUNT_POINT
https://github.com/kubernetes-sigs/sig-windows-tools/blob/3ed28295f8694dffd03e144af4423137b2b8c30b/hostprocess/calico/kube-proxy/kube-proxy.yml#L25

Apparently the binaries (and scripts, I couldn't manage to make the uninstall-calico initContainer work either) are not in these absolute paths, I'm guessing some path in c:/hpc/... would work, but then why not just use $env:CONTAINER_SANDBOX_MOUNT_POINT, right?

This is by far not a deal breaker, stuff was working so I wouldn't block anything by it (unless you see $env:CONTAINER_SANDBOX_MOUNT_POINT being dropped at some point and these containers as they are becoming unsupported).

One other quick question: is the presence of $env:CONTAINER_SANDBOX_MOUNT_POINT a reliable indicator that we're inside a host process container? We've used it in a couple of places, I wonder if we should look into changing that...

Once again, thanks so much!

coutinhop added a commit to coutinhop/operator that referenced this pull request Jul 27, 2023
- Add windows calico-node daemonset with hostprocess containers.
- Add windows kube-proxy daemonset using sig-windows hostprocess
  container image.
- Add optional (required only for windows) fields KUBERNETES_SERVICE_CIDR
  and KUBERNETES_DNS_SERVERS to kubernetes-services-endpoint configmap.

Manually tested with a clean installation and VXLAN setup and images
from projectcalico/calico#7857. Still a work
in progress.

Pending:
- Tests for windows calico-node rendering (pkg/render/windows_test.go)
- Possibly make CNI BIN and CONF dirs configurable for Windows in the
  installation
- Possibly roll out our own kube-proxy image instead of using the one
  from sig-windows
- Possibly adding an initContainer to clean up "old" style non-HPC
  Calico installations before proceeding
Build on the changes from projectcalico#7260:
- Add a Windows cni-plugin hostprocess image, which will install the cni
  binaries and config on the host.
- Add a Windows node hostprocess image, which no longer needs to match
  the hosts' Windows version (e.g. only 1809 needs to be used and is
  supported on 2022).
base for windows HPC images.

Add uninstall-calico powershell script to be run in initContainer, which
will clean up non-HPC Calico installations from the host before starting
Calico Windows HPC.

Fix bug in confd-service.ps1
before substitution (this helps with operator testing).
…h containerd

v1.6+ (instead of requiring v1.7+).
…d with '$env:CONTAINER_SANDBOX_MOUNT_POINT').

Fix logic regarding CNI paths in uninstall-calico-hpc.ps1.
coutinhop added a commit to coutinhop/operator that referenced this pull request Aug 23, 2023
- Add windows calico-node daemonset with hostprocess containers.
- Add windows kube-proxy daemonset using sig-windows hostprocess
  container image.
- Add optional (required only for windows) fields KUBERNETES_SERVICE_CIDR
  and KUBERNETES_DNS_SERVERS to kubernetes-services-endpoint configmap.

Manually tested with a clean installation and VXLAN setup and images
from projectcalico/calico#7857. Still a work
in progress.

Pending:
- Tests for windows calico-node rendering (pkg/render/windows_test.go)
- Possibly make CNI BIN and CONF dirs configurable for Windows in the
  installation
- Possibly roll out our own kube-proxy image instead of using the one
  from sig-windows
- Possibly adding an initContainer to clean up "old" style non-HPC
  Calico installations before proceeding
…ags()

to winutils.go and use them instead of upstream's InClusterConfig(),
BuildConfigFromFlags() in order to get Windows HPC to work with containerd v1.6.
(tests intentionally left untouched in order to try to catch
discrepancies in behavior from upstream if/when they happen).

Fix PATH in Windows dockerfiles.
@coutinhop coutinhop changed the title [WIP] Add GA support for Windows hostprocess Add GA support for Windows hostprocess Sep 6, 2023
@coutinhop coutinhop changed the title Add GA support for Windows hostprocess Add GA support for Windows hostprocess containers Sep 7, 2023
Use winutils.GetHostPath() on file paths in felix param_types.go.
Add resolving of $env:CONTAINER_SANDBOX_MOUNT_POINT to winutils.GetHostPath().
- clarify HPC env var comments
- add logging when skipping chmod on windows
- fix kubeconfigPath prefix issue
Copy link
Contributor

@mgleung mgleung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@coutinhop coutinhop merged commit 7b9f6ce into projectcalico:master Sep 15, 2023
2 checks passed
@coutinhop coutinhop deleted the pedro-win-hpc-1 branch September 15, 2023 22:16
coutinhop added a commit to tigera/operator that referenced this pull request Sep 18, 2023
* Remove calico windows upgrader code

* Add support for Windows hostprocess containers

- Add windows calico-node daemonset with hostprocess containers.

Manually tested with a clean installation and VXLAN setup and images
from projectcalico/calico#7857.

* Add uninstall-calico initContainer to the calico windows daemonset

* Make CNI conf filename configurable for Windows

* Don't render windows daemonset if disabled in installation resource.

* Fix windows HPC support checking and add tests for it

* Make windows HPC requirements be containerd v1.6 (instead of v1.7) and fix file paths to make them compatible with v1.6

* Add nodes/status permissions to cni-plugin clusterrole. Add externalnetworks plugin to node clusterrole when on EE.

Fix CNI paths and template in order for HPC to work with containerd v1.6.

* Move windows daemonset rendering from core-controller into its own windows-controller.

Remove ServiceCIDRs and DNSServers from k8s endpoint configmap. Move ServiceCIDRs to InstallationSpec and auto-detect dns servers from the cluster in windows-controller.

Add getWindowsBackend() function to windows rendering, to deduplicate this calculation.

Remove containerd v1.7+ requirement for windows daemonset (HPC) as it is no longer necessary.

Fix mixups of calls to cniDirectories() in windows rendering.

* Add old calico-windows-upgrade clean up and make default WindowsDataplane disabled on every provider except AKS.

* Fix openshift and RKE2 dns service values on windows_controller

* Remove finalizer logic from windows controller and rendering.

* Further add windows rendering UTs and fix some edge cases.

* Further add windows render UTs

* Address review comments:
Fill in windows default cni paths in installation resource
Remove 'terminating' logic from windows objects
Misc name changes and fixes

* Further address review comments:
- Only populate Installation.WindowsNodes when windows is enabled
- Use same serviceaccount/clusterroles/bindings for windows and linux calico-node
- Remove unused env vars from calico-node-windows containers
- Test changes to accomodate code changes

* Fix WindowsNodeSpec defaults

* Fix path env vars

* - Further address review comments
- Move validation out of windows_controller
- Remove unnecessary env vars from windows node containers
- Test adjustments
coutinhop added a commit to coutinhop/calico that referenced this pull request Jul 16, 2024
The windows upgrade code was cleaned up in projectcalico#7857, but there were leftover calls to it in node-service.ps1, leading to errors:

flag provided but not defined: -should-install-windows-upgrade

In addition to that, the token refresher, when run in a non-HPC scenario, would fail (on top of being unnecessary in non-HPC since that already requires the kubeconfig to be copied into the windows host):

2024-07-16 20:38:31.133 [FATAL][5064] cni-config-monitor/token_watch.go 59: Failed to read service account namespace file error=open /var/run/secrets/kubernetes.io/serviceaccount/namespace: The system cannot find the path specified.

These were masked by the fact that logs are complex to get in non-HPC windows (requires RDP access into the host machine, so e2e failures retrieve nothing), and kubectl get nodes would show the windows nodes as 'Ready' even though these errors were occurring.
coutinhop added a commit to coutinhop/calico that referenced this pull request Jul 18, 2024
The windows upgrade code was cleaned up in projectcalico#7857, but there were leftover calls to it in node-service.ps1, leading to errors:

flag provided but not defined: -should-install-windows-upgrade

In addition to that, the token refresher, when run in a non-HPC scenario, would fail (on top of being unnecessary in non-HPC since that already requires the kubeconfig to be copied into the windows host):

2024-07-16 20:38:31.133 [FATAL][5064] cni-config-monitor/token_watch.go 59: Failed to read service account namespace file error=open /var/run/secrets/kubernetes.io/serviceaccount/namespace: The system cannot find the path specified.

These were masked by the fact that logs are complex to get in non-HPC windows (requires RDP access into the host machine, so e2e failures retrieve nothing), and kubectl get nodes would show the windows nodes as 'Ready' even though these errors were occurring.
coutinhop added a commit to coutinhop/calico that referenced this pull request Jul 18, 2024
The windows upgrade code was cleaned up in projectcalico#7857, but there were leftover calls to it in node-service.ps1, leading to errors:

flag provided but not defined: -should-install-windows-upgrade

In addition to that, the token refresher, when run in a non-HPC scenario, would fail (on top of being unnecessary in non-HPC since that already requires the kubeconfig to be copied into the windows host):

2024-07-16 20:38:31.133 [FATAL][5064] cni-config-monitor/token_watch.go 59: Failed to read service account namespace file error=open /var/run/secrets/kubernetes.io/serviceaccount/namespace: The system cannot find the path specified.

These were masked by the fact that logs are complex to get in non-HPC windows (requires RDP access into the host machine, so e2e failures retrieve nothing), and kubectl get nodes would show the windows nodes as 'Ready' even though these errors were occurring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-pr-required Change is not yet documented release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants