Compute release 2025-01-07 #10288

vipvap · 2025-01-07T10:48:13Z

Compute release 2025-01-07

Please merge this Pull Request using 'Create a merge commit' button

Allow github-action-script to post reports. Failed CI: https://github.com/neondatabase/neon/actions/runs/12304655364/job/34342554049#step:13:514

## Problem pg_regress tests start failing due to unique ids added to Neon error messages ## Summary of changes Patches updated

## Problem We want to extract safekeeper http client to separate crate for use in storage controller and neon_local. However, many types used in the API are internal to safekeeper. ## Summary of changes Move them to safekeeper_api crate. No functional changes. ref #9011

## Problem `test_prefetch` is flaky (#9961), but if it passes, the run time is less than 30 seconds — we don't need an extended timeout for it. ## Summary of changes - Remove extended test timeout for `test_prefetch`

…10083) The test was failing with the scary but generic message `Remote storage metadata corrupted`. The underlying scrubber error is `Orphan layer detected: ...`. The test kills pageserver at random points, hence it's expected that we leak layers if we're killed in the window after layer upload but before it's referenced from index part. Refer to generation numbers RFC for details. Refs: - fixes #9988 - root-cause analysis #9988 (comment)

## Problem LFC used_pages statistic is not updated in case of LFC resize (shrinking `neon.file_cache_size_limit`) ## Summary of changes Update `lfc_ctl->used_pages` in `lfc_change_limit_hook` Co-authored-by: Konstantin Knizhnik <[email protected]>

…9974) Improved comments will help others when they read the code, and the log messages will help others understand why the logical replication monitor works the way it does. Signed-off-by: Tristan Partin <[email protected]>

## Problem close #10124 gc-compaction split_gc_jobs is holding the repartition lock for too long time. ## Summary of changes * Ensure split_gc_compaction_jobs drops the repartition lock once it finishes cloning the structures. * Update comments. --------- Signed-off-by: Alex Chi Z <[email protected]>

## Problem `benchmarking` job fails because `aws-oicd-role-arn` input is not set ## Summary of changes: - Set `aws-oicd-role-arn` for `benchmarking job - Always require `aws-oicd-role-arn` to be set - Rename `aws_oicd_role_arn` to `aws-oicd-role-arn` for consistency

…index-only scan (#9867) ## Problem See #9866 Index-only scan prefetch implementation doesn't take in account that down link may be invalid ## Summary of changes Check that downlink is valid block number Correspondent Postgres PRs: neondatabase/postgres#534 neondatabase/postgres#535 neondatabase/postgres#536 neondatabase/postgres#537 --------- Co-authored-by: Konstantin Knizhnik <[email protected]>

## Problem When entry was dropped and password wasn't set, new entry had uninitialized memory in controlplane adapter Resolves: neondatabase/cloud#14914 ## Summary of changes Initialize password in all cases, add tests. Minor formatting for less indentation

## Problem In #8550, we made the flush loop wait for uploads after every layer. This was to avoid unbounded buildup of uploads, and to reduce compaction debt. However, the approach has several problems: * It prevents upload parallelism. * It prevents flush and upload pipelining. * It slows down ingestion even when there is no need to backpressure. * It does not directly backpressure WAL ingestion (only via `disk_consistent_lsn`), and will build up in-memory layers. * It does not directly backpressure based on compaction debt and read amplification. An alternative solution to these problems is proposed in #8390. In the meanwhile, we revert the change to reduce the impact on ingest throughput. This does reintroduce some risk of unbounded upload/compaction buildup. Until #8390, this can be addressed in other ways: * Use `max_replication_apply_lag` (aka `remote_consistent_lsn`), which will more directly limit upload debt. * Shard the tenant, which will spread the flush/upload work across more Pageservers and move the bottleneck to Safekeeper. Touches #10095. ## Summary of changes Remove waiting on the upload queue in the flush loop.

## Problem See https://neondb.slack.com/archives/C04DGM6SMTM/p1734002916827019 With recent prefetch fixes for pg17 and `effective_io_concurrency=100` pg_regress test stats.sql is failed when set temp_buffers to 100. Stream API will try to lock all this 100 buffers for prefetch. ## Summary of changes Disable such behaviour for temp relations. Postgres PR: neondatabase/postgres#548 Co-authored-by: Konstantin Knizhnik <[email protected]>

## Problem Changes in #9786 were functionally complete but missed some edges that made testing less robust than it should have been: - `is_key_disposable` didn't consider SLRU dir keys disposable - Timeline `init_empty` was always creating SLRU dir keys on all shards The result was that when we had a bug (#10080), it wasn't apparent in tests, because one would only encounter the issue if running on a long-lived timeline with enough compaction to drop the initially created empty SLRU dir keys, _and_ some CLog truncation going on. Closes: neondatabase/cloud#21516 ## Summary of changes - Update is_key_global and init_empty to handle SLRU dir keys properly -- the only functional impact is that we avoid writing some spurious keys in shards >0, but this makes testing much more robust. - Make `test_clog_truncate` explicitly use a sharded tenant The net result is that if one reverts #10080, then tests fail (i.e. this PR is a reproducer for the issue)

## Problem While reviewing #10152 I found it tricky to actually determine whether the connection used `allow_self_signed_compute` or not. I've tried to remove this setting in the past: * #7884 * #7437 * neondatabase/cloud#13702 But each time it seems it is used by e2e tests ## Summary of changes The `node_info.allow_self_signed_computes` is always initialised to false, and then sometimes inherits the proxy config value. There's no need this needs to be in the node_info, so removing it and propagating it via `TcpMechansim` is simpler.

## Problem We want to use safekeeper http client in storage controller and neon_local. ## Summary of changes Extract it to separate crate. No functional changes.

## Problem We've had similar test in test_logical_replication, but then removed it because it wasn't needed to trigger LR related bug. Restarting at WAL page boundary is still a useful test, so add it separately back. ## Summary of changes Add the test.

…ible (#10155) Remove an unnecessary `Result` and address a `FIXME`.

I noticed that the only place we use this flag is for testing console redirect proxy. Makes sense to me to make this assumption more explicit.

As the title says, I updated the lint rules to no longer allow unwrap or unimplemented. Three special cases: * Tests are allowed to use them * std::sync::Mutex lock().unwrap() is common because it's usually correct to continue panicking on poison * `tokio::spawn_blocking(...).await.unwrap()` is common because it will only error if the blocking fn panics, so continuing the panic is also correct I've introduced two extension traits to help with these last two, that are a bit more explicit so they don't need an expect message every time.

Resolves #10159.

## Problem To debug issues with TLS connections there's no easy way to decrypt packets unless a client has special support for logging the keys. ## Summary of changes Add TLS session keys logging to proxy via `SSLKEYLOGFILE` env var gated by flag.

## Problem It's impossible to run docker compose with compute v17 due to `pg_anon` extension which is not supported under PG17. ## Summary of changes The auto-loading of `pg_anon` is disabled by default

…0158) Touches #10097.

## Problem The ABS SDK's default behavior is to do no connection pooling, i.e. open and close a fresh connection for each request. Under high request rates, this can result in an accumulation of TCP connections in TIME_WAIT or CLOSE_WAIT state, and in extreme cases exhaustion of client ports. Related: neondatabase/cloud#20971 ## Summary of changes - Add a configurable `conn_pool_size` parameter for Azure storage, defaulting to zero (current behavior) - Construct a custom reqwest client using this connection pool size.

…er (#10125) ## Problem It was reported as `gauge`, but it's actually a `counter`. Also add `_total` suffix as that's the convention for counters. The corresponding flux-fleet PR: neondatabase/flux-fleet#386

Don't build tests in h3 and rdkit: ~15 min speedup. Use Ninja as cmake generator where possible: ~10 min speedup. Clean apt cache for smaller images: around 250mb size loss for intermediate layers

## Problem Jemalloc heap profiles aren't symbolized. This is inconvenient, and doesn't work with Grafana Cloud Profiles. Resolves #9964. ## Summary of changes Symbolize the heap profiles in-process, and strip unnecessary cruft. This uses about 100 MB additional memory to cache the DWARF information, but I believe this is already the case with CPU profiles, which use the same library for symbolization. With cached DWARF information, the symbolization CPU overhead is negligible. Example profiles: * [pageserver.pb.gz](https://github.com/user-attachments/files/18141395/pageserver.pb.gz) * [safekeeper.pb.gz](https://github.com/user-attachments/files/18141396/safekeeper.pb.gz)

Our solutions engineers and some customers would like to have this extension available. Link: neondatabase/cloud#18890 Signed-off-by: Tristan Partin <[email protected]>

## Problem Frame pointers are typically disabled by default (depending on CPU architecture), to improve performance. This frees up a CPU register, and avoids a couple of instructions per function call. However, it makes stack unwinding much more inefficient, since it has to use DWARF debug information instead, and gives worse results with e.g. `perf` and eBPF profiles. The `backtrace` implementation of `libunwind` is also suspected to cause seg faults. The performance benefit of frame pointer omission doesn't appear to matter that much on modern 64-bit CPU architectures (which have plenty of registers and optimized instruction execution), and benchmarks did not show measurable overhead. The Rust standard library and jemalloc already enable frame pointers by default. For more information, see https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html. Resolves #10224. Touches #10225. ## Summary of changes Enable frame pointers in all builds, and use frame pointers for pprof-rs stack sampling.

…time (#10193) ## Problem close #10192 ## Summary of changes * `find_gc_time_cutoff` takes `now` parameter so that all branches compute the cutoff based on the same start time, avoiding races. * gc-compaction uses a single `get_gc_compaction_watermark` function to get the safe LSN to compact. --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Arpad Müller <[email protected]>

See #10226 (comment).

## Problem It's impossible to run regression tests with Python 3.13 as some dependencies don't support it (some of them are outdated, and `jsonnet` doesn't support it at all yet) ## Summary of changes - Update dependencies for Python 3.13 - Install `jsonnet` only on Python < 3.13 and skip relevant tests on Python 3.13 Closes #10237

## Problem On macOS: ``` error: unused variable: `disable_lfc_resizing` --> compute_tools/src/bin/compute_ctl.rs:431:9 | 431 | disable_lfc_resizing, | ^^^^^^^^^^^^^^^^^^^^ help: try ignoring the field: `disable_lfc_resizing: _` | = note: `-D unused-variables` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_variables)]` ``` ## Summary of changes - Initialise `disable_lfc_resizing` only on Linux (because it's used on Linux only in further bloc)

…10274) Using `min(0, ...)` causes us to fail to wait in most situations, so a lack of data would be a hot wait loop, which is bad. ## Problem We noticed high CPU usage in some situations

…10267) ## Problem `trigger-e2e-tests` waits half an hour before starting to run. Nearly half of that time can be saved by promoting images before tests on them are complete, so the e2e tests can run in parallel. On `main` and `release{,-proxy,-compute}`, `promote-images` updates `latest` and pushes things to prod ecr, so we want to run `promote-images` only after `test-images` is done, but on other branches, there is no harm in promoting images that aren't tested yet. ## Summary of changes To promote images into dev container registries sooner, `promote-images` is split into `promote-images-dev` and `promote-images-prod`. The former pushes to dev container registries, the latter to prod ones. The latter also waits for `test-images`, while the former doesn't. This allows to run `trigger-e2e-tests` sooner.

Apparently, we failed to do this bookkeeping in quite a few places... ## Problem Fixes neondatabase/cloud#22364 ## Summary of changes Add accounting of dropped requests. Note that this includes prefetches dropped due to things like "PS connection dropped unexpectedly" or "prefetch queue is already full", but *not* (yet?) "dropped due to backend shutdown".

github-actions · 2025-01-07T11:46:14Z

7253 tests run: 6900 passed, 0 failed, 353 skipped (full report)

Flaky tests (2)

Postgres 15

test_parallel_copy: release-arm64

Postgres 14

test_timeline_copy[100000]: release-arm64

Code coverage* (full report)

functions: 31.2% (8409 of 26962 functions)
lines: 48.0% (66777 of 139234 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
31bd2dc at 2025-01-07T19:50:24.789Z :recycle:}

## Problem `promote-images` was split into `promote-images-dev` and `promote-images-prod` in #10267. `dev` credentials were loaded in `promote-images-dev` and `prod` credentials were loaded in `promote-images-prod`, but `promote-images-prod` needs `dev` credentials as well to access the `dev` images to replicate them from `dev` to `prod`. ## Summary of changes Load `dev` credentials in `promote-images-prod` as well.

rahulinux and others added 30 commits December 13, 2024 12:22

fix(ci): Allow github-action-script to post reports (#10136)

2451969

Allow github-action-script to post reports. Failed CI: https://github.com/neondatabase/neon/actions/runs/12304655364/job/34342554049#step:13:514

Fix pg_regress tests on a cloud staging instance (#10134)

7dc3826

## Problem pg_regress tests start failing due to unique ids added to Neon error messages ## Summary of changes Patches updated

test_prefetch: reduce timeout to default 5m from 10m (#10105)

2c91062

## Problem `test_prefetch` is flaky (#9961), but if it passes, the run time is less than 30 seconds — we don't need an extended timeout for it. ## Summary of changes - Remove extended test timeout for `test_prefetch`

Extract safekeeper http client to separate crate. (#10140)

1ed0e52

## Problem We want to use safekeeper http client in storage controller and neon_local. ## Summary of changes Extract it to separate crate. No functional changes.

chore: fix clippy lints 2024-12-06 (#10138)

6565fd4

pageserver: make RemoteTimelineClient::schedule_index_upload infall…

3d30a7a

…ible (#10155) Remove an unnecessary `Result` and address a `FIXME`.

chore(proxy): remove allow_self_signed from regular proxy (#10157)

2e4c9c5

I noticed that the only place we use this flag is for testing console redirect proxy. Makes sense to me to make this assumption more explicit.

test_runner: ignore error in test_timeline_archival_chaos (#10161)

28ccda0

Resolves #10159.

Fix docker compose with PG17 (#10165)

e226d7a

## Problem It's impossible to run docker compose with compute v17 due to `pg_anon` extension which is not supported under PG17. ## Summary of changes The auto-loading of `pg_anon` is disabled by default

postgres_ffi: add WalStreamDecoder::complete_record() benchmark (#1…

b0e43c2

…0158) Touches #10097.

fix(compute): Report compute_backpressure_throttling_seconds as count…

2dfd3ca

…er (#10125) ## Problem It was reported as `gauge`, but it's actually a `counter`. Also add `_total` suffix as that's the convention for counters. The corresponding flux-fleet PR: neondatabase/flux-fleet#386

Don't build tests in compute image, use ninja (#10149)

007b13b

Don't build tests in h3 and rdkit: ~15 min speedup. Use Ninja as cmake generator where possible: ~10 min speedup. Clean apt cache for smaller images: around 250mb size loss for intermediate layers

Add pg_repack extension (#10100)

7dddbb9

Our solutions engineers and some customers would like to have this extension available. Link: neondatabase/cloud#18890 Signed-off-by: Tristan Partin <[email protected]>

erikgrinaker and others added 9 commits January 6, 2025 17:27

Dockerfile: build with force-frame-pointers=yes (#10286)

b342a02

See #10226 (comment).

libpagestore: timeout = max(0, difference), not min(0, difference) (#…

30863c0

…10274) Using `min(0, ...)` causes us to fail to wait in most situations, so a lack of data would be a hot wait loop, which is bad. ## Problem We noticed high CPU usage in some situations

Compute release 2025-01-07

6292d93

vipvap requested review from a team as code owners January 7, 2025 10:48

vipvap requested review from myrrc, lubennikovaav, VladLazar, clipperhouse, mtyazici and conradludgate and removed request for a team January 7, 2025 10:48

hlinnaka requested a review from ololobus January 7, 2025 10:52

ololobus approved these changes Jan 7, 2025

View reviewed changes

hlinnaka merged commit 441517d into release-compute Jan 8, 2025
88 checks passed

hlinnaka deleted the rc/release-compute/2025-01-07 branch January 8, 2025 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute release 2025-01-07 #10288

Compute release 2025-01-07 #10288

vipvap commented Jan 7, 2025

github-actions bot commented Jan 7, 2025 •

edited

Loading

Postgres 15

Postgres 14

Compute release 2025-01-07 #10288

Compute release 2025-01-07 #10288

Conversation

vipvap commented Jan 7, 2025