Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pprof-rs seg faults #10225

Closed
erikgrinaker opened this issue Dec 22, 2024 · 7 comments
Closed

pprof-rs seg faults #10225

erikgrinaker opened this issue Dec 22, 2024 · 7 comments
Assignees
Labels
a/observability Area: related to observability c/storage Component: storage t/bug Issue Type: Bug

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Dec 22, 2024

We saw a Pageserver segfault while taking a CPU profile via pprof-rs.

We haven't seen any such segfaults in staging, despite running continuous profiling for a few weeks.

We've also seen multiple seg faults daily in staging since enabling continuous profiling.

@erikgrinaker erikgrinaker added a/observability Area: related to observability c/storage Component: storage t/bug Issue Type: Bug labels Dec 22, 2024
@erikgrinaker erikgrinaker self-assigned this Dec 22, 2024
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Dec 22, 2024

It's possible that this might work better if we enable frame pointers rather than using DWARF for stack unwinding: grafana/pyroscope-rs#124. This would also improve performance of stack unwinding, see #10224.

It's possible that aarch64 builds already enable the frame pointer by default, as aarch64 typically mandates use of a dedicated frame pointer register. Pageservers run on aarch64 nodes.

It also seems like the pprof-rs frame-pointer feature is nightly-only, due to issues with the stdlib not handling frame pointers correctly, but this may have been resolved when the stdlib recently enabled frame pointers by default.

@erikgrinaker
Copy link
Contributor Author

It could also be that the jemalloc and pprof profilers are somehow interfering with each other. jemalloc profiles are taken every 1 MB allocated.

@erikgrinaker
Copy link
Contributor Author

It could also be that the jemalloc and pprof profilers are somehow interfering with each other.

Materialize hit this issue as well: tikv/pprof-rs#36

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 3, 2025

We haven't seen any such segfaults in staging, despite running continuous profiling for a few weeks.

Scratch that, we've seen frequent segfaults in staging since we enabled continuous profiling.

As an initial workaround, we can disable heap profile sampling when taking a CPU profile. However, that will prevent us from using continuous profiling and heap profiling, since we'll be taking CPU profiles all the time.

For now, let's just disable heap profiles entirely.

github-merge-queue bot pushed a commit that referenced this issue Jan 3, 2025
## Problem

Since enabling continuous profiling in staging, we've seen frequent seg
faults. This is suspected to be because jemalloc and pprof-rs take a
stack trace at the same time, and the handlers aren't signal safe.
jemalloc does this probabilistically on every allocation, regardless of
whether someone is taking a heap profile, which means that any CPU
profile has a chance to cause a seg fault.

Touches #10225.

## Summary of changes

For now, just disable heap profiles -- CPU profiles are more important,
and we need to be able to take them without risking a crash.
erikgrinaker added a commit that referenced this issue Jan 3, 2025
## Problem

Since enabling continuous profiling in staging, we've seen frequent seg
faults. This is suspected to be because jemalloc and pprof-rs take a
stack trace at the same time, and the handlers aren't signal safe.
jemalloc does this probabilistically on every allocation, regardless of
whether someone is taking a heap profile, which means that any CPU
profile has a chance to cause a seg fault.

Touches #10225.

## Summary of changes

For now, just disable heap profiles -- CPU profiles are more important,
and we need to be able to take them without risking a crash.
@erikgrinaker
Copy link
Contributor Author

Still seeing seg faults in staging after disabling jemalloc heap profiling. Must be an issue with pprof-rs itself.

github-merge-queue bot pushed a commit that referenced this issue Jan 6, 2025
## Problem

Frame pointers are typically disabled by default (depending on CPU
architecture), to improve performance. This frees up a CPU register, and
avoids a couple of instructions per function call. However, it makes
stack unwinding much more inefficient, since it has to use DWARF debug
information instead, and gives worse results with e.g. `perf` and eBPF
profiles. The `backtrace` implementation of `libunwind` is also
suspected to cause seg faults.

The performance benefit of frame pointer omission doesn't appear to
matter that much on modern 64-bit CPU architectures (which have plenty
of registers and optimized instruction execution), and benchmarks did
not show measurable overhead.

The Rust standard library and jemalloc already enable frame pointers by
default.

For more information, see
https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html.

Resolves #10224.
Touches #10225.

## Summary of changes

Enable frame pointers in all builds, and use frame pointers for pprof-rs
stack sampling.
@erikgrinaker
Copy link
Contributor Author

Using frame pointers stopped the seg faults: #10226. Keeping this open until we conclude whether to keep them.

github-merge-queue bot pushed a commit that referenced this issue Jan 7, 2025
This reverts commit b33299d.

Heap profiles weren't the culprit after all.

Touches #10225.
@erikgrinaker
Copy link
Contributor Author

Resolved by #10226.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/observability Area: related to observability c/storage Component: storage t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

1 participant