-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable logical replication subscribers #10249
base: main
Are you sure you want to change the base?
Conversation
7370 tests run: 6988 passed, 0 failed, 382 skipped (full report)Flaky tests (4)Postgres 17
Postgres 16
Postgres 15
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
229e674 at 2025-01-21T15:17:51.847Z :recycle: |
# This is not desired behavior, because if sub were running, it would have created a race condition | ||
# with sub and sub_child_2 both moving the same replication slot forward. | ||
# | ||
# TODO: is there any way to test this race without flakiness? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, what if you do not shut down the original subscriber, just start another one from the branch, then
- Insert 1k records on the publisher, and get LSN1
- wait until slot is advanced on the pub to >=LSN1 (probably also wait for both subs to catch up?)
- check the state of both subs, my understanding is that they both should be caught up, i.e. LSN is >= LSN1 (we can wait for this and assert on it); but one of the subs won't have all records (assert on this)
I'm totally not sure in 3., so I'd just follow these steps and inspect the state of both subs. If my understanding is correct, then waiting for both subs to catch up + asserting on N rows should work
cc @arssher, Anastasia is trying to write a predictable test for the case of two subs/branches trying to replicate from the same publisher/slot. Maybe you have some thoughts/suggestions
c3d0c87
to
be663bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the storage part of the patch
@@ -444,7 +444,7 @@ async fn get_operations<'a>( | |||
} | |||
ApplySpecPhase::RunInEachDatabase { db, subphase } => { | |||
match subphase { | |||
PerDatabasePhase::DropSubscriptionsForDeletedDatabases => { | |||
PerDatabasePhase::DropLogicalSubscriptions => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL file should be probably renamed as well drop_subscription_for_drop_dbs.sql
-> drop_subscriptions.sql
?
res = scur.fetchall() | ||
assert res[0][0] == n_records | ||
|
||
# wake the sub and ensure that it catches up with the new data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, interesting. So, at this point, you have two subscribers using the same publisher/slot, and both are able to catch up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are totally independent and use different publications and slots.
The old sub (on parent branch) catches up using initial sub
and pub
.
and the child branch uses sub_new
and pub_new
.
Here I just wanted to test that sub
is not affected by operations that we did on a child branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, but haven't had a closer look at the test, will do it tomorrow. Also, PR description is very outdated
compute_tools/src/compute.rs
Outdated
// it will start again with the same spec and flag value. | ||
// | ||
// To handle this, we save the fact of the operation in the database | ||
// in the neon_migration_test.drop_subscriptions_done table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think it should be
// in the neon_migration_test.drop_subscriptions_done table. | |
// in the neon_migration.drop_subscriptions_done table. |
- Should we create a new schema
neon
(or use it if it already exists)? I ask because this operation doesn't sound as a normal migration, so reusing theneon_migration
is a bit misleading. Overall, I think we should probably use the sameneon
schema for all internal metadata insidepostgres
DB
compute_tools/src/compute.rs
Outdated
let query = format!("select 1 from neon_migration.drop_subscriptions_done where timeline_id = '{}'", timeline_id); | ||
|
||
info!("Checking if drop_subscriptions_done was already performed for timeline_id: {}", timeline_id); | ||
info!("Query: {}", query); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT, should it be a debug?
]; | ||
|
||
if spec.drop_subscriptions_before_start && !drop_subscriptions_done { | ||
info!("Adding FinalizeDropLogicalSubscriptions phase because drop_subscriptions_before_start is set"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: can you please make logging consistent? I.e. start everything with lowercase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I've tried to keep it consistent with surrounding code.
Looks like we don't have one style in this file
} | ||
// This is a hack to enable testing of the drop_subscriptions_before_start feature. | ||
// Why don't we just always set the database vector properly here | ||
// so that the code matched real cplane logic? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For real-life logic, it should be some non-superuser role and non-postgres DB, so it's more like in the case of create_test_user=true
<- and I'd do this, yes, let's convert this to TODO?
But I don't really understand why you need postgres DB here? It's already present and it's onwer is cloud_admin
. It's always like that because they are created by initdb
INSERT INTO neon_migration.drop_subscriptions_done | ||
VALUES (current_setting('neon.timeline_id')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just inserting we will bloat this table if user creates branch of a branch OR users restore, because each time we will only append a new timeline here.
I think we can safely clean the table each time, but in a transaction block (which is probably ensured by procedure block, but not sure), like
BEGIN;
DELETE FROM neon_migration.drop_subscriptions_done
INSERT INTO neon_migration.drop_subscriptions_done
VALUES (current_setting('neon.timeline_id'));
END;
That way, we only have the latest timeline in this table, which is what we actually need. And its size will be likely limited to a single page
REVOKE ALL ON SCHEMA neon_migration FROM PUBLIC; | ||
|
||
CREATE TABLE neon_migration.drop_subscriptions_done | ||
(timeline_id text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure where exactly, but I'd mention in the comments why we store the timeline_id here, i.e. that during time-travel and branch resets we may need to do the drop again, so we cannot rely on just boolean flag as it might be already set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the comment here, but there is also big comment in apply_spec_sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but there is also big comment in apply_spec_sql
Yeah, that was good, thanks, it just didn't mention why we need to store timeline id
disable_logical_replication_subscribers
with logical subscriptions
. If it is set, drop all the subscriptions from the compute node before it starts. To avoid race on compute start, use new GUC neon.disable_logical_replication_subscribers to temporarily disable logical replication workers until we drop the subscriptions.
Co-authored-by: Alexey Kondratov <[email protected]>
when endpoint starts on a new branch. It is essential, because itherwise, we may drop not only inherited, but newly created subscriptions. We cannot rely only on spec.drop_subscriptions_before_start flag, because if for some reason compute restarts inside VM, it will start again with the same spec and flag value. To handle this, we save the fact of the operation in the database in the neon_migration_test.drop_subscriptions_done table. If the table does not exist, we assume that the operation was never performed, so we must do it. If table exists, we check if the operation was performed on the current timelilne. Also adjust the test test_subscriber_branching.py to cover this case.
8e200c1
to
0d63443
Compare
Problem
#8790
Summary of changes
This PR add new GUC
disable_logical_replication_subscribers
that will prevent subscribers from starting on a new branch.TODO: