Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU load on Debian 12 VM caused by LKRG #301

Open
gnd opened this issue Jan 26, 2024 · 7 comments
Open

High CPU load on Debian 12 VM caused by LKRG #301

gnd opened this issue Jan 26, 2024 · 7 comments
Labels
shortcoming Something is worse than desired

Comments

@gnd
Copy link

gnd commented Jan 26, 2024

Hello,

we recently upgraded some of our VMs to Debian 12. They are used to run php8.2 for some web apps. However as soon as we recompiled LKRG with the new kernel and started it, we noticed the CPU reaching 100% very fast. This leads to a machine lockups and the apps become slow and non-responsive.

We tried many things but it seems like LKRG is the issue. Once started the load reached 100% very fast, once turned off, load falls back to normal within a minute. We run LKRG on dozens of machines but only the ones running Debian 12 AND php have this issue. Older Debian machines with PHP and LKRG have no problem. So do machines that do not run PHP workloads.

We tried fiddling with the module's parameters, eg. krg.profile_validate setting it via sysctl all the way to 0, but this didnt help.

We also tried looking into older LKRG releases and run them - with the same result (specifically it was 7db7483). In the current state we cant run LKRG, even tho we would like to have it :(

Do you have any ideas what might be wrong, or how to help you debug this issue ? Thanks !

Attached is an screenshot from Grafana, showing the effect of re/enabling LKRG 3 times in a row.

lkrg_load

@solardiz solardiz added the shortcoming Something is worse than desired label Jan 26, 2024
@solardiz
Copy link
Contributor

Thank you for reporting this @gnd. My main two guesses as to what could be causing this are:

  1. Too frequent kernel integrity verification, which LKRG by default does not only periodically, but also on "random events". However, if you did in fact tried lowering lkrg.profile_validate all the way to 0 and that didn't help, this guess is ruled out. You may want to double-check, though, by setting lkrg.kint_validate to a lower value (it should be sufficient to lower it from 3 to 2, but you can also try 0).

  2. Too frequent updates of the kernel's code. The kernel uses self-modifying code for so-called "jump labels", and LKRG keeps track of that. In fact, currently LKRG does so even when lkrg.kint_validate is 0, so that you'd be able to switch from 0 to non-0 later. Maybe we need to add a mode where such tracking is also disabled, or just disable it at 0 and either don't allow switching to non-0 without LKRG reload or perform hash recalculation when switching from 0 to non-0. Maybe we also need to add a way to update hashes to reflect a "jump label" change quickly, without full recalculation, although for that we'd have to use weaker hashing or a large number of hashes (e.g., one hash per 4 KiB).

Per your analysis so far, this is more likely issue 2 above.

It's puzzling that PHP causes this. It's also puzzling that a "jump label" would presumably be switching back and forth - normally, these are only switched once or very infrequently (on changes to kernel runtime configuration via sysctl or such). This could indicate a minor kernel bug, where what was meant to be an optimization ended up the other way around, since even without LKRG updating the kernel code has some performance cost.

@Adam-pi3
Copy link
Collaborator

Is it possible to see the list of all processes while you have such a spike of CPU usage? If the problem is related to JUMP_LABEL we should see a spikes related to kernel worker threads

@gnd
Copy link
Author

gnd commented Jan 29, 2024

Hello, unfortunately, if you mean the number of kworker processes, their numbers remained the same. Here is a log:

# ps -ef|grep kworker|grep -v grep|wc -l
37
# systemctl start lkrg; sleep 240; ps -ef|grep kworker|grep -v grep|wc -l
39
# w
 10:28:33 up 5 days, 14:28,  3 users,  load average: 143.80, 74.00, 30.48
# systemctl stop lkrg

@solardiz
Copy link
Contributor

I think Adam meant not the number of those processes, but whether they're the ones actively running on CPU (e.g. per top) during the load spikes. Anyway, you show that the number of kworker processes is way lower than the load average, suggesting that there are many other processes in running state. It would be helpful to see the output of ps axo pid,pcpu,stat,time,wchan:30,comm k -pcpu during one of those load spikes.

@gnd
Copy link
Author

gnd commented Jan 30, 2024

Hello, attached are two files, One before enabling LKRG, second one after LKRG is enabled, when load reached > 100.

ps_before.txt
ps_after.txt

@solardiz
Copy link
Contributor

Thanks @gnd. This is puzzling. We really need the WCHAN field to hopefully figure it out. I don't know why exactly it is empty for you, but perhaps you need to run ps with greater privileges?

@gnd
Copy link
Author

gnd commented Jan 30, 2024

This might be because of some custom sysctl settings .. let me check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
shortcoming Something is worse than desired
Projects
None yet
Development

No branches or pull requests

3 participants