-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN crash and gossip store corruption #7971
Comments
|
I think it happened again:
|
I cannot access the attachment? Also, what's the filesystem where you're putting the gossip_store file? I've just done an audit and I cannot see how this would happen, but I'm double-checking now. |
Hm, seems like GitHub didn't like the attachment. Here it is on S3: https://rizful-public.s3.us-east-1.amazonaws.com/temp/gossip_store-corrupt.zip CLN is running in docker, please see the Dockerfile below The entire CLN directory is on a ZFS mirror, running on Ubuntu 22 -- the mirror is two drives that ZFS mirrors to look like one. But as far as I know this should only reduce the chances of data corruption because ZFS mirrors are (supposedly) so rock-solid.
|
We have a report of this happening under ZFS. We cannot do much if this really is a problem where we can't read back what we write, but this avoids the immediate crash. Fixes: ElementsProject#7971 Signed-off-by: Rusty Russell <[email protected]>
We have a report of this happening under ZFS. We cannot do much if this really is a problem where we can't read back what we write, but this avoids the immediate crash. Fixes: ElementsProject#7971 Signed-off-by: Rusty Russell <[email protected]> Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.
I suspect ZFS has a race where we can see zeroes if reading during writing. This should now be handled; it's a bit awkward though. |
OK, I downloaded the file. It has overwritten a record at byte offset 1, corrupting the entire thing. On further analysis, this is possible with our current code if HAVE_PWRITEV is not set! However, the same PR changes this too (to use independent file descriptors) so this is also fixed. |
Thanks, this is interesting, because I think it's the first time I have bad behavior on any system (and any application) which might be attributed to ZFS. I wonder if CLN is somehow not using the "guarantees" of the OS and is instead reaching around to get to the raw data on the drive? Anyway, this question is out of my league. I will update when there is a new Docker image and report if it happens again, thanks. |
We have a report of this happening under ZFS. We cannot do much if this really is a problem where we can't read back what we write, but this avoids the immediate crash. Fixes: ElementsProject#7971 Signed-off-by: Rusty Russell <[email protected]> Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.
After running this image of CLN without (apparent) problems for about two weeks....
CLN suddenly crashed with ...
A more complete log showing the crash and the creation of
gossip_store.corrupt
is attachedcln-gossip-store-corrupt.txt
The text was updated successfully, but these errors were encountered: