Replies: 1 comment
-
You appear to be making some big claims here, through phrases like:
If you have data to support any of this, please show it. It's not worth discussing solutions until it's established that there is a problem, and I don't see it here. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
double checksumming could reduce the risk of memory errors affecting data integrity in ZFS. In fact, this idea could enhance data verification by ensuring that two independent checksums are applied at different stages — one for data as it resides in memory and another for the written data on disk.
Here’s how double checksumming could theoretically help:
Memory-based checksumming: By adding a checksum in memory, the system could compare data in RAM before it is written to disk, helping to catch any potential corruption during the write process. This could serve as a second line of defense against corrupted data caused by faulty memory.
Writing checksumming: Even if data is corrupted in RAM, the disk checksum could be compared against the memory checksum before final write to disk. If both checksums do not match, the system could flag the issue, even if it wasn't detected during the earlier write phase.
Currently, ZFS checksums the data as it is written to the disk, ensuring that any read of the data from disk can be compared to the checksum to detect corruption. However, it does not check for errors that may have been introduced in memory during the write process. A double-checking mechanism could flag errors earlier in the workflow, giving ZFS a more proactive ability to protect data.
By introducing an extra checksum at the memory layer, the system could also reduce the risk of silent corruption going unnoticed until a scrub operation is run or the data is actually needed.
this would be amazing but we have to consider that Memory errors could still corrupt data before it is even checksummed, and the extra checksum might only identify the issue once the data has been written or read incorrectly.
what I am suggesting:
Implementing double checksumming would require modifying ZFS code to include checksums at both the memory and disk levels. This could potentially be done by the community or through contributions from developers. My suggestion is offering a function to enable this on hosts that does not support ECC memory. Therefore this extra overhead and complexity would not affect users that do not need it. Currently people that run TrueNAS for example, if not using ECC memory are highly exposed to RAM error propagation to snapshots, metadata, snapshots and can lead to the pool to be unmountable thus having the data nearly unrecoverable. The Scrubbing process is the main culprit of propagating errors originated from the ram memory and leading data to unusable. This would be highly beneficial to ensure at least some level of reliability. At the moment TrueNAS is much less safe than EXT4 due to openZFS filesystem extra integrity requisites relying on ECC RAM. (currently there's no redundancy there for NON-ECC RAM)
Beta Was this translation helpful? Give feedback.
All reactions