Electric fails to detect when its persistent storage goes away and prevents Postgres from cleaning up WAL files #1719

alco · 2024-09-17T07:59:06Z

Imagine a simple scenario where Electric is running inside a Kubernetes pod with no persistent storage. Some shapes get created, so Electric creates a publication in Postgres and starts processing transactions.

When the pod is restarted, a new file system is created for it with no traces of the previous shape storage. Electric will no longer be able to process incoming transactions from Postgres, causing the latter to build up its WAL backlog indefinitely.

Electric should be able to detect this failure mode and drop transactions when there is no active shape collector process instead of keeping those transactions around by refusing to advance the replication slot.

thruflo · 2024-09-17T08:24:17Z

N.b.: https://discord.com/channels/933657521581858818/1285476835412541516

msfstef · 2024-10-08T14:02:32Z

I think this is also related to #1774 - if Electric boots up and has no shapes, it should update the replication slot accordingly. I feel that there should be a mechanism/service that keeps the publication properly configured, and perhaps that should also inform how to handle "deprecated" transactions

robacourt · 2024-11-05T17:22:15Z

For multi-tenancy, losing storage would mean we don't know where the databases are, so we couldn't clean up the publications. But multi-tenancy is an advanced use case so perhaps we can get away with not dealing with it.

balegas · 2024-11-05T17:30:57Z

Electric should be able to detect this failure mode and drop transactions when there is no active shape collector process instead of keeping those transactions around by refusing to advance the replication slot.

If I read correctly, when a fresh Electric finds a pre-exiting replication slot it will find that it is in an inconsistent state with local metadata (which is empty) and doesn't use it. This is a conservative approach in cases the owner of the replication slot connects later.

if Electric boots up and has no shapes, it should update the replication slot accordingly.

@msfstef, meaning just drop the replication slot and recreate it right? shall we make sure this intention intentional, --force-recreatea-replication-slot or being more optimistic since it's important to make sure that we cleanup the WAL and shapes are cheap anyways (cc @KyleAMathews @robacourt )

alco added the bug label Sep 17, 2024

alco mentioned this issue Sep 17, 2024

Document on-disk persistent storage requirements #1720

Closed

alco added the reliability label Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Electric fails to detect when its persistent storage goes away and prevents Postgres from cleaning up WAL files #1719

Electric fails to detect when its persistent storage goes away and prevents Postgres from cleaning up WAL files #1719

alco commented Sep 17, 2024

thruflo commented Sep 17, 2024

msfstef commented Oct 8, 2024

robacourt commented Nov 5, 2024

balegas commented Nov 5, 2024 •

edited

Loading

Electric fails to detect when its persistent storage goes away and prevents Postgres from cleaning up WAL files #1719

Electric fails to detect when its persistent storage goes away and prevents Postgres from cleaning up WAL files #1719

Comments

alco commented Sep 17, 2024

thruflo commented Sep 17, 2024

msfstef commented Oct 8, 2024

robacourt commented Nov 5, 2024

balegas commented Nov 5, 2024 • edited Loading

balegas commented Nov 5, 2024 •

edited

Loading