Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable logical replication subscribers #10249

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

lubennikovaav
Copy link
Contributor

Problem

#8790

Summary of changes

This PR add new GUC disable_logical_replication_subscribers that will prevent subscribers from starting on a new branch.

TODO:

  • test code cleanup
  • add cloud code to set this GUC for computes that start on non-main branch

Copy link

github-actions bot commented Dec 30, 2024

7370 tests run: 6988 passed, 0 failed, 382 skipped (full report)


Flaky tests (4)

Postgres 17

Postgres 16

Postgres 15

Code coverage* (full report)

  • functions: 33.6% (8443 of 25113 functions)
  • lines: 49.1% (70780 of 144041 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
229e674 at 2025-01-21T15:17:51.847Z :recycle:

# This is not desired behavior, because if sub were running, it would have created a race condition
# with sub and sub_child_2 both moving the same replication slot forward.
#
# TODO: is there any way to test this race without flakiness?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder, what if you do not shut down the original subscriber, just start another one from the branch, then

  1. Insert 1k records on the publisher, and get LSN1
  2. wait until slot is advanced on the pub to >=LSN1 (probably also wait for both subs to catch up?)
  3. check the state of both subs, my understanding is that they both should be caught up, i.e. LSN is >= LSN1 (we can wait for this and assert on it); but one of the subs won't have all records (assert on this)

I'm totally not sure in 3., so I'd just follow these steps and inspect the state of both subs. If my understanding is correct, then waiting for both subs to catch up + asserting on N rows should work

cc @arssher, Anastasia is trying to write a predictable test for the case of two subs/branches trying to replicate from the same publisher/slot. Maybe you have some thoughts/suggestions

vendor/postgres-v17 Outdated Show resolved Hide resolved
@lubennikovaav lubennikovaav force-pushed the disable_logical_replication_subscribers branch from c3d0c87 to be663bd Compare January 9, 2025 09:45
@lubennikovaav lubennikovaav marked this pull request as ready for review January 9, 2025 09:46
@lubennikovaav lubennikovaav requested review from a team as code owners January 9, 2025 09:46
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the storage part of the patch

compute_tools/src/spec_apply.rs Show resolved Hide resolved
@@ -444,7 +444,7 @@ async fn get_operations<'a>(
}
ApplySpecPhase::RunInEachDatabase { db, subphase } => {
match subphase {
PerDatabasePhase::DropSubscriptionsForDeletedDatabases => {
PerDatabasePhase::DropLogicalSubscriptions => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL file should be probably renamed as well drop_subscription_for_drop_dbs.sql -> drop_subscriptions.sql?

compute_tools/src/compute.rs Show resolved Hide resolved
res = scur.fetchall()
assert res[0][0] == n_records

# wake the sub and ensure that it catches up with the new data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting. So, at this point, you have two subscribers using the same publisher/slot, and both are able to catch up?

Copy link
Contributor Author

@lubennikovaav lubennikovaav Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are totally independent and use different publications and slots.

The old sub (on parent branch) catches up using initial sub and pub.
and the child branch uses sub_new and pub_new.

Here I just wanted to test that sub is not affected by operations that we did on a child branch.

@ololobus ololobus removed the request for review from kelvich January 20, 2025 20:45
Copy link
Member

@ololobus ololobus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, but haven't had a closer look at the test, will do it tomorrow. Also, PR description is very outdated

// it will start again with the same spec and flag value.
//
// To handle this, we save the fact of the operation in the database
// in the neon_migration_test.drop_subscriptions_done table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think it should be
Suggested change
// in the neon_migration_test.drop_subscriptions_done table.
// in the neon_migration.drop_subscriptions_done table.
  1. Should we create a new schema neon (or use it if it already exists)? I ask because this operation doesn't sound as a normal migration, so reusing the neon_migration is a bit misleading. Overall, I think we should probably use the same neon schema for all internal metadata inside postgres DB

compute_tools/src/compute.rs Outdated Show resolved Hide resolved
compute_tools/src/compute.rs Outdated Show resolved Hide resolved
let query = format!("select 1 from neon_migration.drop_subscriptions_done where timeline_id = '{}'", timeline_id);

info!("Checking if drop_subscriptions_done was already performed for timeline_id: {}", timeline_id);
info!("Query: {}", query);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, should it be a debug?

];

if spec.drop_subscriptions_before_start && !drop_subscriptions_done {
info!("Adding FinalizeDropLogicalSubscriptions phase because drop_subscriptions_before_start is set");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: can you please make logging consistent? I.e. start everything with lowercase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I've tried to keep it consistent with surrounding code.
Looks like we don't have one style in this file

compute_tools/src/compute.rs Outdated Show resolved Hide resolved
}
// This is a hack to enable testing of the drop_subscriptions_before_start feature.
// Why don't we just always set the database vector properly here
// so that the code matched real cplane logic?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For real-life logic, it should be some non-superuser role and non-postgres DB, so it's more like in the case of create_test_user=true <- and I'd do this, yes, let's convert this to TODO?

But I don't really understand why you need postgres DB here? It's already present and it's onwer is cloud_admin. It's always like that because they are created by initdb

Comment on lines 18 to 19
INSERT INTO neon_migration.drop_subscriptions_done
VALUES (current_setting('neon.timeline_id'));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just inserting we will bloat this table if user creates branch of a branch OR users restore, because each time we will only append a new timeline here.

I think we can safely clean the table each time, but in a transaction block (which is probably ensured by procedure block, but not sure), like

BEGIN;
DELETE FROM neon_migration.drop_subscriptions_done
INSERT INTO neon_migration.drop_subscriptions_done
    VALUES (current_setting('neon.timeline_id'));
END;

That way, we only have the latest timeline in this table, which is what we actually need. And its size will be likely limited to a single page

compute_tools/src/sql/finalize_drop_subscriptions.sql Outdated Show resolved Hide resolved
REVOKE ALL ON SCHEMA neon_migration FROM PUBLIC;

CREATE TABLE neon_migration.drop_subscriptions_done
(timeline_id text);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure where exactly, but I'd mention in the comments why we store the timeline_id here, i.e. that during time-travel and branch resets we may need to do the drop again, so we cannot rely on just boolean flag as it might be already set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the comment here, but there is also big comment in apply_spec_sql

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but there is also big comment in apply_spec_sql

Yeah, that was good, thanks, it just didn't mention why we need to store timeline id

lubennikovaav and others added 8 commits January 21, 2025 09:41
disable_logical_replication_subscribers
with logical subscriptions
. If it is set, drop all the subscriptions from the compute node
before it starts. To avoid race on compute start, use new GUC neon.disable_logical_replication_subscribers
to temporarily disable logical replication workers until we drop the subscriptions.
when endpoint starts on a new branch.

It is essential, because itherwise, we may drop
not only inherited, but newly created subscriptions.

We cannot rely only on spec.drop_subscriptions_before_start flag,
because if for some reason compute restarts inside VM,
it will start again with the same spec and flag value.

To handle this, we save the fact of the operation in the database
in the neon_migration_test.drop_subscriptions_done table.
If the table does not exist, we assume that the operation was never performed, so we must do it.
If table exists, we check if the operation was performed on the current timelilne.

Also adjust the test test_subscriber_branching.py to cover this case.
@lubennikovaav lubennikovaav force-pushed the disable_logical_replication_subscribers branch from 8e200c1 to 0d63443 Compare January 21, 2025 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants