Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Electric fails to detect when its persistent storage goes away and prevents Postgres from cleaning up WAL files #1719

Open
alco opened this issue Sep 17, 2024 · 4 comments

Comments

@alco
Copy link
Member

alco commented Sep 17, 2024

Imagine a simple scenario where Electric is running inside a Kubernetes pod with no persistent storage. Some shapes get created, so Electric creates a publication in Postgres and starts processing transactions.

When the pod is restarted, a new file system is created for it with no traces of the previous shape storage. Electric will no longer be able to process incoming transactions from Postgres, causing the latter to build up its WAL backlog indefinitely.

Electric should be able to detect this failure mode and drop transactions when there is no active shape collector process instead of keeping those transactions around by refusing to advance the replication slot.

@thruflo
Copy link
Contributor

thruflo commented Sep 17, 2024

@msfstef
Copy link
Contributor

msfstef commented Oct 8, 2024

I think this is also related to #1774 - if Electric boots up and has no shapes, it should update the replication slot accordingly. I feel that there should be a mechanism/service that keeps the publication properly configured, and perhaps that should also inform how to handle "deprecated" transactions

@robacourt
Copy link
Contributor

For multi-tenancy, losing storage would mean we don't know where the databases are, so we couldn't clean up the publications. But multi-tenancy is an advanced use case so perhaps we can get away with not dealing with it.

@balegas
Copy link
Contributor

balegas commented Nov 5, 2024

Electric should be able to detect this failure mode and drop transactions when there is no active shape collector process instead of keeping those transactions around by refusing to advance the replication slot.

If I read correctly, when a fresh Electric finds a pre-exiting replication slot it will find that it is in an inconsistent state with local metadata (which is empty) and doesn't use it. This is a conservative approach in cases the owner of the replication slot connects later.

if Electric boots up and has no shapes, it should update the replication slot accordingly.

@msfstef, meaning just drop the replication slot and recreate it right? shall we make sure this intention intentional, --force-recreatea-replication-slot or being more optimistic since it's important to make sure that we cleanup the WAL and shapes are cheap anyways (cc @KyleAMathews @robacourt )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants