Disable logical replication subscribers #10249

lubennikovaav · 2024-12-30T19:16:18Z

Problem

Summary of changes

This PR add new GUC disable_logical_replication_subscribers that will prevent subscribers from starting on a new branch.

TODO:

test code cleanup
add cloud code to set this GUC for computes that start on non-main branch

github-actions · 2024-12-30T20:40:48Z

7370 tests run: 6988 passed, 0 failed, 382 skipped (full report)

Flaky tests (4)

Postgres 17

test_metrics_normal_work: release-x86-64, release-x86-64

Postgres 16

test_scrubber_physical_gc_ancestors[None]: release-arm64

Postgres 15

test_metrics_normal_work: release-arm64

Code coverage* (full report)

functions: 33.6% (8443 of 25113 functions)
lines: 49.1% (70780 of 144041 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
229e674 at 2025-01-21T15:17:51.847Z :recycle:}

ololobus · 2025-01-06T14:33:38Z

test_runner/regress/test_subscriber_branching.py

+        # This is not desired behavior, because if sub were running, it would have created a race condition
+        # with sub and sub_child_2 both moving the same replication slot forward.
+        #
+        # TODO: is there any way to test this race without flakiness?


I wonder, what if you do not shut down the original subscriber, just start another one from the branch, then

Insert 1k records on the publisher, and get LSN1

wait until slot is advanced on the pub to >=LSN1 (probably also wait for both subs to catch up?)

check the state of both subs, my understanding is that they both should be caught up, i.e. LSN is >= LSN1 (we can wait for this and assert on it); but one of the subs won't have all records (assert on this)

I'm totally not sure in 3., so I'd just follow these steps and inspect the state of both subs. If my understanding is correct, then waiting for both subs to catch up + asserting on N rows should work

cc @arssher, Anastasia is trying to write a predictable test for the case of two subs/branches trying to replicate from the same publisher/slot. Maybe you have some thoughts/suggestions

vendor/postgres-v17

test_runner/regress/test_subscriber_branching.py

skyzh

LGTM for the storage part of the patch

compute_tools/src/spec_apply.rs

ololobus · 2025-01-09T14:42:54Z

compute_tools/src/spec_apply.rs

@@ -444,7 +444,7 @@ async fn get_operations<'a>(
        }
        ApplySpecPhase::RunInEachDatabase { db, subphase } => {
            match subphase {
-                PerDatabasePhase::DropSubscriptionsForDeletedDatabases => {
+                PerDatabasePhase::DropLogicalSubscriptions => {


SQL file should be probably renamed as well drop_subscription_for_drop_dbs.sql -> drop_subscriptions.sql?

compute_tools/src/compute.rs

ololobus · 2025-01-09T14:50:18Z

test_runner/regress/test_subscriber_branching.py

+            res = scur.fetchall()
+            assert res[0][0] == n_records
+
+        # wake the sub and ensure that it catches up with the new data


Hmm, interesting. So, at this point, you have two subscribers using the same publisher/slot, and both are able to catch up?

They are totally independent and use different publications and slots.

The old sub (on parent branch) catches up using initial sub and pub.
and the child branch uses sub_new and pub_new.

Here I just wanted to test that sub is not affected by operations that we did on a child branch.

ololobus

Left some comments, but haven't had a closer look at the test, will do it tomorrow. Also, PR description is very outdated

ololobus · 2025-01-20T20:49:34Z

compute_tools/src/compute.rs

+            // it will start again with the same spec and flag value.
+            //
+            // To handle this, we save the fact of the operation in the database
+            // in the neon_migration_test.drop_subscriptions_done table.


I think it should be

Suggested change

// in the neon_migration_test.drop_subscriptions_done table.

// in the neon_migration.drop_subscriptions_done table.

Should we create a new schema neon (or use it if it already exists)? I ask because this operation doesn't sound as a normal migration, so reusing the neon_migration is a bit misleading. Overall, I think we should probably use the same neon schema for all internal metadata inside postgres DB

compute_tools/src/compute.rs

ololobus · 2025-01-20T20:53:20Z

compute_tools/src/compute.rs

+                let query = format!("select 1 from neon_migration.drop_subscriptions_done where timeline_id = '{}'", timeline_id);
+
+                info!("Checking if drop_subscriptions_done was already performed for timeline_id: {}", timeline_id);
+                info!("Query: {}", query);


NIT, should it be a debug?

ololobus · 2025-01-20T20:57:07Z

compute_tools/src/compute.rs

+            ];
+
+            if spec.drop_subscriptions_before_start && !drop_subscriptions_done {
+                info!("Adding FinalizeDropLogicalSubscriptions phase because drop_subscriptions_before_start is set");


NIT: can you please make logging consistent? I.e. start everything with lowercase

hm, I've tried to keep it consistent with surrounding code.
Looks like we don't have one style in this file

compute_tools/src/compute.rs

ololobus · 2025-01-20T21:09:49Z

control_plane/src/endpoint.rs

+                }
+                // This is a hack to enable testing of the drop_subscriptions_before_start feature.
+                // Why don't we just always set the database vector properly here
+                // so that the code matched real cplane logic?


For real-life logic, it should be some non-superuser role and non-postgres DB, so it's more like in the case of create_test_user=true <- and I'd do this, yes, let's convert this to TODO?

But I don't really understand why you need postgres DB here? It's already present and it's onwer is cloud_admin. It's always like that because they are created by initdb

ololobus · 2025-01-20T21:15:20Z

compute_tools/src/sql/finalize_drop_subscriptions.sql

+    INSERT INTO neon_migration.drop_subscriptions_done
+    VALUES (current_setting('neon.timeline_id'));


Just inserting we will bloat this table if user creates branch of a branch OR users restore, because each time we will only append a new timeline here.

I think we can safely clean the table each time, but in a transaction block (which is probably ensured by procedure block, but not sure), like

BEGIN; DELETE FROM neon_migration.drop_subscriptions_done INSERT INTO neon_migration.drop_subscriptions_done VALUES (current_setting('neon.timeline_id')); END;

That way, we only have the latest timeline in this table, which is what we actually need. And its size will be likely limited to a single page

compute_tools/src/sql/finalize_drop_subscriptions.sql

ololobus · 2025-01-20T21:20:26Z

compute_tools/src/sql/finalize_drop_subscriptions.sql

+        REVOKE ALL ON SCHEMA neon_migration FROM PUBLIC;
+
+        CREATE TABLE neon_migration.drop_subscriptions_done
+        (timeline_id text);


Not sure where exactly, but I'd mention in the comments why we store the timeline_id here, i.e. that during time-travel and branch resets we may need to do the drop again, so we cannot rely on just boolean flag as it might be already set

added the comment here, but there is also big comment in apply_spec_sql

but there is also big comment in apply_spec_sql

Yeah, that was good, thanks, it just didn't mention why we need to store timeline id

disable_logical_replication_subscribers

with logical subscriptions

. If it is set, drop all the subscriptions from the compute node before it starts. To avoid race on compute start, use new GUC neon.disable_logical_replication_subscribers to temporarily disable logical replication workers until we drop the subscriptions.

Co-authored-by: Alexey Kondratov <[email protected]>

when endpoint starts on a new branch. It is essential, because itherwise, we may drop not only inherited, but newly created subscriptions. We cannot rely only on spec.drop_subscriptions_before_start flag, because if for some reason compute restarts inside VM, it will start again with the same spec and flag value. To handle this, we save the fact of the operation in the database in the neon_migration_test.drop_subscriptions_done table. If the table does not exist, we assume that the operation was never performed, so we must do it. If table exists, we check if the operation was performed on the current timelilne. Also adjust the test test_subscriber_branching.py to cover this case.

ololobus reviewed Jan 6, 2025

View reviewed changes

vendor/postgres-v17 Outdated Show resolved Hide resolved

ololobus reviewed Jan 6, 2025

View reviewed changes

test_runner/regress/test_subscriber_branching.py Outdated Show resolved Hide resolved

lubennikovaav force-pushed the disable_logical_replication_subscribers branch from c3d0c87 to be663bd Compare January 9, 2025 09:45

lubennikovaav marked this pull request as ready for review January 9, 2025 09:46

lubennikovaav requested review from a team as code owners January 9, 2025 09:46

lubennikovaav requested review from hlinnaka, kelvich, mattpodraza, mtyazici and skyzh January 9, 2025 09:46

skyzh approved these changes Jan 9, 2025

View reviewed changes

compute_tools/src/spec_apply.rs Show resolved Hide resolved

ololobus reviewed Jan 9, 2025

View reviewed changes

lubennikovaav requested a review from ololobus January 15, 2025 19:37

ololobus removed the request for review from kelvich January 20, 2025 20:45

ololobus reviewed Jan 20, 2025

View reviewed changes

lubennikovaav and others added 8 commits January 21, 2025 09:41

Add new GUC

2c29644

disable_logical_replication_subscribers

add test for branching of timeline

e56fc57

with logical subscriptions

add new compute_ctl spec flag

2c07b1c

. If it is set, drop all the subscriptions from the compute node before it starts. To avoid race on compute start, use new GUC neon.disable_logical_replication_subscribers to temporarily disable logical replication workers until we drop the subscriptions.

Update compute_tools/src/compute.rs

5dc49d3

Co-authored-by: Alexey Kondratov <[email protected]>

review fixes

1642b00

update vendor/revisions.json

8157ba0

fix typo

0d63443

lubennikovaav force-pushed the disable_logical_replication_subscribers branch from 8e200c1 to 0d63443 Compare January 21, 2025 10:28

Move CreateSchemaNeon to a separate phase

f1ada98

lubennikovaav requested a review from ololobus January 21, 2025 14:06

fix formatting

229e674

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable logical replication subscribers #10249

Disable logical replication subscribers #10249

lubennikovaav commented Dec 30, 2024

github-actions bot commented Dec 30, 2024 •

edited

Loading

Postgres 17

Postgres 16

Postgres 15

ololobus Jan 6, 2025

skyzh left a comment

ololobus Jan 9, 2025

ololobus Jan 9, 2025

lubennikovaav Jan 9, 2025 •

edited

Loading

ololobus left a comment

ololobus Jan 20, 2025

ololobus Jan 20, 2025

ololobus Jan 20, 2025

lubennikovaav Jan 21, 2025

ololobus Jan 20, 2025

ololobus Jan 20, 2025

ololobus Jan 20, 2025

lubennikovaav Jan 21, 2025

ololobus Jan 21, 2025

	// in the neon_migration_test.drop_subscriptions_done table.
	// in the neon_migration.drop_subscriptions_done table.

		INSERT INTO neon_migration.drop_subscriptions_done
		VALUES (current_setting('neon.timeline_id'));

Disable logical replication subscribers #10249

Are you sure you want to change the base?

Disable logical replication subscribers #10249

Conversation

lubennikovaav commented Dec 30, 2024

Problem

Summary of changes

github-actions bot commented Dec 30, 2024 • edited Loading

7370 tests run: 6988 passed, 0 failed, 382 skipped (full report)

Postgres 17

Postgres 16

Postgres 15

Code coverage* (full report)

Choose a reason for hiding this comment

skyzh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lubennikovaav Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

ololobus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 30, 2024 •

edited

Loading

lubennikovaav Jan 9, 2025 •

edited

Loading