pprof-rs seg faults #10225

erikgrinaker · 2024-12-22T19:27:27Z

We saw a Pageserver segfault while taking a CPU profile via pprof-rs.

~~We haven't seen any such segfaults in staging, despite running continuous profiling for a few weeks.~~

We've also seen multiple seg faults daily in staging since enabling continuous profiling.

The text was updated successfully, but these errors were encountered:

erikgrinaker · 2024-12-22T19:33:33Z

It's possible that this might work better if we enable frame pointers rather than using DWARF for stack unwinding: grafana/pyroscope-rs#124. This would also improve performance of stack unwinding, see #10224.

It's possible that aarch64 builds already enable the frame pointer by default, as aarch64 typically mandates use of a dedicated frame pointer register. Pageservers run on aarch64 nodes.

It also seems like the pprof-rs frame-pointer feature is nightly-only, due to issues with the stdlib not handling frame pointers correctly, but this may have been resolved when the stdlib recently enabled frame pointers by default.

erikgrinaker · 2024-12-23T10:35:10Z

It could also be that the jemalloc and pprof profilers are somehow interfering with each other. jemalloc profiles are taken every 1 MB allocated.

erikgrinaker · 2024-12-30T09:23:23Z

It could also be that the jemalloc and pprof profilers are somehow interfering with each other.

Materialize hit this issue as well: tikv/pprof-rs#36

erikgrinaker · 2025-01-03T13:57:19Z

We haven't seen any such segfaults in staging, despite running continuous profiling for a few weeks.

Scratch that, we've seen frequent segfaults in staging since we enabled continuous profiling.

As an initial workaround, we can disable heap profile sampling when taking a CPU profile. However, that will prevent us from using continuous profiling and heap profiling, since we'll be taking CPU profiles all the time.

For now, let's just disable heap profiles entirely.

## Problem Since enabling continuous profiling in staging, we've seen frequent seg faults. This is suspected to be because jemalloc and pprof-rs take a stack trace at the same time, and the handlers aren't signal safe. jemalloc does this probabilistically on every allocation, regardless of whether someone is taking a heap profile, which means that any CPU profile has a chance to cause a seg fault. Touches #10225. ## Summary of changes For now, just disable heap profiles -- CPU profiles are more important, and we need to be able to take them without risking a crash.

erikgrinaker · 2025-01-04T13:48:27Z

Still seeing seg faults in staging after disabling jemalloc heap profiling. Must be an issue with pprof-rs itself.

## Problem Frame pointers are typically disabled by default (depending on CPU architecture), to improve performance. This frees up a CPU register, and avoids a couple of instructions per function call. However, it makes stack unwinding much more inefficient, since it has to use DWARF debug information instead, and gives worse results with e.g. `perf` and eBPF profiles. The `backtrace` implementation of `libunwind` is also suspected to cause seg faults. The performance benefit of frame pointer omission doesn't appear to matter that much on modern 64-bit CPU architectures (which have plenty of registers and optimized instruction execution), and benchmarks did not show measurable overhead. The Rust standard library and jemalloc already enable frame pointers by default. For more information, see https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html. Resolves #10224. Touches #10225. ## Summary of changes Enable frame pointers in all builds, and use frame pointers for pprof-rs stack sampling.

erikgrinaker · 2025-01-07T20:25:56Z

Using frame pointers stopped the seg faults: #10226. Keeping this open until we conclude whether to keep them.

This reverts commit b33299d. Heap profiles weren't the culprit after all. Touches #10225.

erikgrinaker · 2025-01-08T15:05:08Z

Resolved by #10226.

erikgrinaker added a/observability Area: related to observability c/storage Component: storage t/bug Issue Type: Bug labels Dec 22, 2024

erikgrinaker self-assigned this Dec 22, 2024

erikgrinaker mentioned this issue Dec 29, 2024

cargo: build with frame pointers #10226

Merged

erikgrinaker mentioned this issue Jan 3, 2025

pageserver,safekeeper: disable heap profiling #10268

Merged

erikgrinaker mentioned this issue Jan 7, 2025

Revert "pageserver,safekeeper: disable heap profiling (#10268)" #10303

Merged

github-merge-queue bot pushed a commit that referenced this issue Jan 7, 2025

Revert "pageserver,safekeeper: disable heap profiling (#10268)" (#10303)

237dae7

This reverts commit b33299d. Heap profiles weren't the culprit after all. Touches #10225.

erikgrinaker closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pprof-rs seg faults #10225

pprof-rs seg faults #10225

erikgrinaker commented Dec 22, 2024 •

edited

Loading

erikgrinaker commented Dec 22, 2024 •

edited

Loading

erikgrinaker commented Dec 23, 2024

erikgrinaker commented Dec 30, 2024

erikgrinaker commented Jan 3, 2025 •

edited

Loading

erikgrinaker commented Jan 4, 2025

erikgrinaker commented Jan 7, 2025

erikgrinaker commented Jan 8, 2025

pprof-rs seg faults #10225

pprof-rs seg faults #10225

Comments

erikgrinaker commented Dec 22, 2024 • edited Loading

erikgrinaker commented Dec 22, 2024 • edited Loading

erikgrinaker commented Dec 23, 2024

erikgrinaker commented Dec 30, 2024

erikgrinaker commented Jan 3, 2025 • edited Loading

erikgrinaker commented Jan 4, 2025

erikgrinaker commented Jan 7, 2025

erikgrinaker commented Jan 8, 2025

erikgrinaker commented Dec 22, 2024 •

edited

Loading

erikgrinaker commented Dec 22, 2024 •

edited

Loading

erikgrinaker commented Jan 3, 2025 •

edited

Loading