Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cargo: build with frame pointers #10226

Merged
merged 2 commits into from
Jan 6, 2025
Merged

cargo: build with frame pointers #10226

merged 2 commits into from
Jan 6, 2025

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Dec 22, 2024

Problem

Frame pointers are typically disabled by default (depending on CPU architecture), to improve performance. This frees up a CPU register, and avoids a couple of instructions per function call. However, it makes stack unwinding much more inefficient, since it has to use DWARF debug information instead, and gives worse results with e.g. perf and eBPF profiles. The backtrace implementation of libunwind is also suspected to cause seg faults.

The performance benefit of frame pointer omission doesn't appear to matter that much on modern 64-bit CPU architectures (which have plenty of registers and optimized instruction execution), and benchmarks did not show measurable overhead.

The Rust standard library and jemalloc already enable frame pointers by default.

For more information, see https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html.

Resolves #10224.
Touches #10225.

Summary of changes

Enable frame pointers in all builds, and use frame pointers for pprof-rs stack sampling.

@erikgrinaker erikgrinaker changed the title cargo: build with fram pointers cargo: build with frame pointers Dec 22, 2024
Copy link

github-actions bot commented Dec 22, 2024

7227 tests run: 6875 passed, 0 failed, 352 skipped (full report)


Flaky tests (5)

Postgres 17

Postgres 16

Code coverage* (full report)

  • functions: 31.2% (8412 of 26948 functions)
  • lines: 48.0% (66780 of 139204 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
e180c0b at 2025-01-04T15:21:22.545Z :recycle:

@erikgrinaker
Copy link
Contributor Author

I ran some benchmarks for WalStreamDecoder::complete_record() on a Linux amd64 box. This isn't necessarily representative (it's mostly very optimized CRC32 checksumming), but it is a CPU-bound hot path that doesn't involve allocations or IO. The results don't show significant overhead -- there is a small amount of variation though, so I ran each benchmark for 30 seconds multiple times, with similar results.

complete_record/size=64 time:   [41.561 ns 41.891 ns 42.152 ns]
                        thrpt:  [1.4141 GiB/s 1.4229 GiB/s 1.4341 GiB/s]
                 change:
                        time:   [+0.7893% +1.1356% +1.4278%] (p = 0.00 < 0.05)
                        thrpt:  [-1.4077% -1.1228% -0.7831%]

complete_record/size=1024
                        time:   [144.64 ns 144.73 ns 144.83 ns]
                        thrpt:  [6.5847 GiB/s 6.5893 GiB/s 6.5935 GiB/s]
                 change:
                        time:   [-3.4100% -3.3531% -3.2898%] (p = 0.00 < 0.05)
                        thrpt:  [+3.4017% +3.4694% +3.5303%]

complete_record/size=8192
                        time:   [965.16 ns 965.27 ns 965.39 ns]
                        thrpt:  [7.9029 GiB/s 7.9039 GiB/s 7.9048 GiB/s]
                 change:
                        time:   [+0.4465% +0.4593% +0.4730%] (p = 0.00 < 0.05)
                        thrpt:  [-0.4708% -0.4572% -0.4445%]

complete_record/size=131072
                        time:   [13.298 µs 13.352 µs 13.425 µs]
                        thrpt:  [9.0927 GiB/s 9.1425 GiB/s 9.1799 GiB/s]
                 change:
                        time:   [+5.9949% +6.8990% +7.8233%] (p = 0.00 < 0.05)
                        thrpt:  [-7.2557% -6.4537% -5.6558%]

@erikgrinaker
Copy link
Contributor Author

Interestingly, the frame-pointer feature of pprof-rs is an order of magnitude slower than just using libunwind without frame pointers (11 µs vs. 1.4 µs). libunwind performance did not change with or without frame pointers.

This was using a benchmark of pprof-rs TraceImpl::trace() with a stack depth of 40 on Linux.

This probably isn't worth it then. There was some hope that it might resolve #10225, as seen in grafana/pyroscope-rs#124, but if it requires the frame-pointer feature then it'll cause a 10x slowdown of traces which isn't acceptable.

@erikgrinaker
Copy link
Contributor Author

Given the frequency with which we're seeing seg faults in staging, even with heap profiling disabled, maybe this is worth a shot -- at the very least, to see if it resolves the seg faults.

@erikgrinaker erikgrinaker reopened this Jan 4, 2025
@erikgrinaker erikgrinaker marked this pull request as ready for review January 4, 2025 14:20
@erikgrinaker erikgrinaker requested review from a team and jcsp and removed request for a team January 4, 2025 14:20
@jcsp jcsp added this pull request to the merge queue Jan 6, 2025
Merged via the queue into main with commit 95f1920 Jan 6, 2025
82 checks passed
@jcsp jcsp deleted the erik/frame-pointer branch January 6, 2025 17:28
@skyzh
Copy link
Member

skyzh commented Jan 6, 2025

neon/Dockerfile

Lines 47 to 58 in 95f1920

RUN set -e \
&& PQ_LIB_DIR=$(pwd)/pg_install/v${STABLE_PG_VERSION}/lib RUSTFLAGS="-Clinker=clang -Clink-arg=-fuse-ld=mold -Clink-arg=-Wl,--no-rosegment ${ADDITIONAL_RUSTFLAGS}" cargo build \
--bin pg_sni_router \
--bin pageserver \
--bin pagectl \
--bin safekeeper \
--bin storage_broker \
--bin storage_controller \
--bin proxy \
--bin neon_local \
--bin storage_scrubber \
--locked --release

Do we also need to add it to the dockerfile or the flags will be combined?

@erikgrinaker
Copy link
Contributor Author

Do we also need to add it to the dockerfile or the flags will be combined?

Good catch, the environment variable will override them. I'll submit a followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build with frame pointers for improved profiling
3 participants