-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression in CPU utilization virtualization #538
Comments
OK this seems to be something connected to the HWE kernel that ships with Jammy or some change in Jammy in general, although please see questions below. I did a pretty hardcore nuke of the 5.0 install:
Then Although, I can't find a way for me to check which version of lxcfs is actually running, so I can't be sure lxcfs was really being downgraded as well - how to check this? Also, despite the hardcore nuke seen above, I still saw some old data hanging around - for example doing @stgraber Got any input on all this? Thanks :) I would like my testing to be valid and actually revert to an older version of lxcfs in order to determine for sure whether this is an lxcfs regression or something has changed in newer kernels/Jammy which causes this. If I don't get any response here, I guess my next test is to start over with my system and install Focal instead of Jammy and see if stuff works with lxd v.5.0 which would then help determine if this is some change in Jammy which lxcfs is not handling or a regression as I initially thought. |
All right some more testing later and the results are in: On Ubuntu Focal 20.04, LXD 5.0 both with stock kernel and after HWE kernel installation (5.13.0-40-generic) the issue is not present So this is definitely some change in Jammy and/or kernels newer than 5.13.0-40 I have a bunch of systems deployed in a datacenter which I am unable to go live with due to this bug, as customers will surely complain, so my only choice here is to do a full reinstall over KVM to Focal - which is exceedingly slow and painful - unless I can get some hints on how to track this this down and/or fix this :) I'll give it a day or two to see if anybody here wakes up and gives me some pointers before diving into that particular madness. Thanks! |
Have you tried booting with unified_cgroup_hierarchy=0 as a kernel boot argument to see if its cgroupv2 related? |
@tomponline Thank you for chiming in and for providing this hint. I tried adding this to
After performing this change and a reboot, the problem has gone away! Note for the curious: swapaccount is lxd related setting we need for our setup and init_on_alloc is a zfs related optimization. Followup questions: Although this solves my immediate problem, is it an issue for me moving forward having done this? cgroupv2 seems like a good thing and is what you support moving forward I'm guessing ... But I guess I could always switch back to using cgroupv2 after lxcfs is fixed, or is that naive of me? |
I suspect this is a problem with LXCFS in pure cgroupv2 environments which needs fixing. |
Great thank you for the update - we have already downgraded all affected systems. I'll be back to test this once a fix has been implemented :) |
I have same issue in debian 11 with lxcfs 5.0.1. Switching to cgroupv1 fixed it. Looking forward for proper cgroupv2 support |
@brauner interested in looking into this one? |
Getting this fixed would be very nice |
Chiming in that this is still experienced in v5.0.3. |
That's not a LXCFS bug, the problem is that currently Line 1136 in cd2e3ac
I'll put this in my ToDo to looks how to properly enable this controller with LXD. -- Upd. The cgroup-v1 So, that's a kernel limitation. Nothing can be done here from the LXCFS side. cc @stgraber |
Ping +2 years later Now that cgroupv2 is an inevitable fact and will fully replace v1, where upcoming systemd will even refuse to boot under cgroupv1 apparently, I feel this issue needs to be revisited and has become more pressing. Would it be a matter of putting in a request with whomever maintains cgroupv2 to resurrect cpuacct.usage_all. and then you'd be able to implement this in lxcfs ? We really depend on this functionality to provide accurate cpu utilization metrics to container customers and are motivated to get this moving in the right direction. If you could point us to where we can raise this issue or provide financial motivation, even, to implement this metric that's needed in cgroupv2, that would be much appreciated. |
LXD v 5.0.0
Ubuntu Jammy 5.15.0-27-generic
Create a Ubuntu container and stress all your CPUs on the host system, e.g. stress -c 72 enter the container and run htop and you will see the system reports CPUs utilization at 100% across all CPUs. Load avg. reporting is OK at ~0
This seems to be some regression as on Focal 5.13.0-35-generic systems running v4.19 CPU utilization is correctly reported as ~0% across all threads inside the container.
The text was updated successfully, but these errors were encountered: