-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pprof-rs seg faults #10225
Comments
It's possible that this might work better if we enable frame pointers rather than using DWARF for stack unwinding: grafana/pyroscope-rs#124. This would also improve performance of stack unwinding, see #10224. It's possible that aarch64 builds already enable the frame pointer by default, as aarch64 typically mandates use of a dedicated frame pointer register. Pageservers run on aarch64 nodes. It also seems like the pprof-rs |
It could also be that the jemalloc and pprof profilers are somehow interfering with each other. jemalloc profiles are taken every 1 MB allocated. |
Materialize hit this issue as well: tikv/pprof-rs#36 |
Scratch that, we've seen frequent segfaults in staging since we enabled continuous profiling. As an initial workaround, we can disable heap profile sampling when taking a CPU profile. However, that will prevent us from using continuous profiling and heap profiling, since we'll be taking CPU profiles all the time. For now, let's just disable heap profiles entirely. |
## Problem Since enabling continuous profiling in staging, we've seen frequent seg faults. This is suspected to be because jemalloc and pprof-rs take a stack trace at the same time, and the handlers aren't signal safe. jemalloc does this probabilistically on every allocation, regardless of whether someone is taking a heap profile, which means that any CPU profile has a chance to cause a seg fault. Touches #10225. ## Summary of changes For now, just disable heap profiles -- CPU profiles are more important, and we need to be able to take them without risking a crash.
## Problem Since enabling continuous profiling in staging, we've seen frequent seg faults. This is suspected to be because jemalloc and pprof-rs take a stack trace at the same time, and the handlers aren't signal safe. jemalloc does this probabilistically on every allocation, regardless of whether someone is taking a heap profile, which means that any CPU profile has a chance to cause a seg fault. Touches #10225. ## Summary of changes For now, just disable heap profiles -- CPU profiles are more important, and we need to be able to take them without risking a crash.
Still seeing seg faults in staging after disabling jemalloc heap profiling. Must be an issue with pprof-rs itself. |
## Problem Frame pointers are typically disabled by default (depending on CPU architecture), to improve performance. This frees up a CPU register, and avoids a couple of instructions per function call. However, it makes stack unwinding much more inefficient, since it has to use DWARF debug information instead, and gives worse results with e.g. `perf` and eBPF profiles. The `backtrace` implementation of `libunwind` is also suspected to cause seg faults. The performance benefit of frame pointer omission doesn't appear to matter that much on modern 64-bit CPU architectures (which have plenty of registers and optimized instruction execution), and benchmarks did not show measurable overhead. The Rust standard library and jemalloc already enable frame pointers by default. For more information, see https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html. Resolves #10224. Touches #10225. ## Summary of changes Enable frame pointers in all builds, and use frame pointers for pprof-rs stack sampling.
Using frame pointers stopped the seg faults: #10226. Keeping this open until we conclude whether to keep them. |
Resolved by #10226. |
We saw a Pageserver segfault while taking a CPU profile via
pprof-rs
.We haven't seen any such segfaults in staging, despite running continuous profiling for a few weeks.We've also seen multiple seg faults daily in staging since enabling continuous profiling.
The text was updated successfully, but these errors were encountered: