-
Notifications
You must be signed in to change notification settings - Fork 503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage when running Katran in shared mode with bonding interface #235
Comments
sudo sysctl -a | grep bpf ? |
Hi @tehnerd, |
Hmm. Strange. |
Also how the traffic pattern looks like ? Are they real tcp streams or just random packets |
UPDATE:
|
Feels like bpf program is not jitted. Could you please run bpftool prog list and bpftool map list |
Also in perf report (which was taken with -ag) |
Hi @tehnerd, Would you happen to have any updates on this issue? |
No idea. For some reason bpf program seems slow in bpf code itself. At this point the only idea is to build perf with bpf support (link against that library which is required to disassembly bpf) and to check where that cpus is spent inside bpf. |
You mention that it feels like the bpf program is not jitted. |
Hi @tehnerd, It has been a while and I finally got the assembly code output inside the Katran bpf loadbalance program. Command: I attached the content of the Could you please take a look at that? |
All of that is memory accesses. It feels like there is some issue with it. Either slow memory or system is low on memory. Or tlb is trashed. What the environment looks like ? Is it vm or not? What the memory? How much ? How much free? Would be nice to see perf counters for memory accesses (stalled front end backend cycles. Tlb stats) |
Here is the memory information from my server.
|
Hi @tehnerd, Did you find any clues from the memory stats? |
Can you run perf to collect counters for "cycles,stalled-cycles-fronted,stalled-cycles-backend". The only explanation that make sense for mov to / |
It's quite a challenge to collect those counters because my CPU model is a Cascade Lake microarchitecture, and it seems that
I look into kernel event code:
|
I found the IPC indicator which is useful to identify CPU stalled. (this blog is beneficial: https://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html) The value is around 0.18 and 0.23 which is quite low and the command Interpretation from the blog shows that the low IPC value is related to low memory I/O:
I suppose that focusing on memory tuning can increase performance but I have no clues on that yet. |
Yeah. Stalls are way too hard. And they are not even on map accesses. Anyway I think there is some issue with hardware? Do you have any other test machine to run ? |
That seems pretty strange. I do not think that it is related to NIC. Seems like some strange memory related hardware issue/ specific. I will post later today how to run it, but wonder what are the results of synthetic load tests would looks like. They would test just bpf code itself. From |
Hi everyone!
I am currently running Katran as a L3 Director load balancer for our services.
I would like to run Katran with a bonding interface because I believe it's easier to add more network interfaces rather than servers for scaling Katran's workload.
I followed this issue (#13) and make the Katran work normally in shared mode and bonding interface with those command:
And all 20 CPUs are consumed by
ksoftirqd
Here is the screenshot that shown the output of
perf report
:I am not sure this performance issue is relating to Katran or not. So, I post this question here to find some clues.
Feel free to ask me to provide more information!
The text was updated successfully, but these errors were encountered: