Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive memory usage using Docker Swarm with Windows Server 22 #337

Closed
andyfisher100 opened this issue Mar 17, 2023 · 14 comments
Closed

Massive memory usage using Docker Swarm with Windows Server 22 #337

andyfisher100 opened this issue Mar 17, 2023 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@andyfisher100
Copy link

Describe the bug
We have been using docker swarm for our windows containers for a number of years now. We have not long upgraded our host nodes to run Windows Server 2022.

We have noticed that since the upgrade there has been a huge spike in memory consumption and eventually the node becomes unusable hitting 98% memory used over a couple of weeks. Our servers has between 64GB and 128Gb of memory.

What we notice is that over a 1-2 week period the Non-paged pool memory just gradually increases until full. This behaviour usually suggests a memory leak of some kind.

In the end we just have to reboot the host, which is not ideal in a production system

I believe it may be linked to the way some of our containers run and that swarm is trying to keep track of containers that are no longer running. We use our containers as build Agents for Azure DevOps. Some of the containers are configured to "Run Once". In this setup, a Azure DevOps pipeline job runs and when finished, the agent process dies and thus the container also dies. swarm realises that there is now one less replica and then spins up a new clean build agent.

I don't know if containers that have died are some how still having memory reserved for them by the OS.

To Reproduce

  1. Setup a docker swarm configuration running windows server 22
  2. Run a swarm job where the containers will self stop
  3. Non-Paged memory usage will gradually increase

Expected behaviour
I would expect the memory usage to not steadily increase until the host becomes unusable and requires a reboot

Configuration:

  • Edition: Windows Server 2022 Standard Edition
  • Base Image being used: mcr.microsoft.com/dotnet/framework/runtime:4.8-windowsservercore-ltsc2022
  • Container engine: Docker
  • Container Engine version 20.10.9

Additional context
This issue did not seem to occur when the hosts ran Windows Server 2019

@andyfisher100 andyfisher100 added the bug Something isn't working label Mar 17, 2023
@microsoft-github-policy-service microsoft-github-policy-service bot added the triage New and needs attention label Mar 17, 2023
@fady-azmy-msft fady-azmy-msft removed the triage New and needs attention label Mar 20, 2023
@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

1 similar comment
@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

@fady-azmy-msft
Copy link
Contributor

Looking into this. I created an internal ticket (#44339481) for tracking.

@MikeZappa87
Copy link

@fady-azmy-msft I believe this is already being tracked.

@andyfisher100
Copy link
Author

andyfisher100 commented Apr 27, 2023

Yes I believe this is being tracked internally per communication I've had with my companies Microsoft representative. I used poolmon to track down what was using the non page pool memory and it was the HTab tag, this then appeared to be a known issue with Win Server 22

I'd like to leave this open until the Windows Update is published to fix the issue as this may help others if they experience the same problem

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

5 similar comments
@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

@microsoft-github-policy-service
Copy link
Contributor

This issue has been open for 30 days with no updates.
@MikeZappa87, please provide an update or close this issue.

@fady-azmy-msft
Copy link
Contributor

This bug should have been fixed with 9B (September's Patch Tuesday). @andyfisher100 can you confirm this no longer repros for you?

@andyfisher100
Copy link
Author

andyfisher100 commented Oct 11, 2023

@fady-azmy-msft just to confirm that the patch is KB5030216?

I have installed the patch on multiple host machines that are part of our docker swarm. One of the machines the non-paged memory pool seems have stabilised around 10.9GB although this still seems a little high. However on another machine the non-paged memory is currently up to 20.3Gb after 6 days of uptime and seems to be increasing by about 3GB daily. Using poolmon it would appear that it is still the HTab memory pool consuming the majority of the non-paged memory

We also have a support case open with yourselves and have been performing multiple memory dumps for analysis etc. The support engineer has mentioned that handles are being left open by containers. In our docker setup, the containers managed by swarm are short lived, they run one azure devops pipeline job and then the container exits and swarm starts a new container for the next azure devops pipeline job. This means that a host may have many stopped containers from a swarm service.
I have also seen files locked by a process in the Docker\windowsfilter folder when no containers are running but had been running previously.

The Microsoft support engineer suggested that we need to open a case with our docker provider. Currently we make use of Docker CE/Moby as instructed in the MS documentation here https://learn.microsoft.com/en-us/virtualization/windowscontainers/quick-start/set-up-environment?tabs=dockerce#windows-server-1

Would you suggest raising an issue on the Moby page or is this issue sufficient?

@andyfisher100
Copy link
Author

I have been working with MS support and it appears that this issue is now resolved via windows updates. I can't see anything specifically in the Windows Server 2022 patch notes so I can't say for sure if its the patch @fady-azmy-msft mentioned or if it is a later patch, the system is patched with KB5032198 (Nov Cumulative update)

Also worth noting that I updated docker to the very latest version during my patching/testing too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants