[critical redundancy issue] suggestion of double checksum for non-ECC ram systems (currently there's NO redundancy on scrubbing against NON-ECC RAM error propagation) #16804

willianma · 2024-11-25T04:31:39Z

willianma
Nov 25, 2024

double checksumming could reduce the risk of memory errors affecting data integrity in ZFS. In fact, this idea could enhance data verification by ensuring that two independent checksums are applied at different stages — one for data as it resides in memory and another for the written data on disk.

Here’s how double checksumming could theoretically help:

Additional Layer of Memory Protection
Memory-based checksumming: By adding a checksum in memory, the system could compare data in RAM before it is written to disk, helping to catch any potential corruption during the write process. This could serve as a second line of defense against corrupted data caused by faulty memory.
Writing checksumming: Even if data is corrupted in RAM, the disk checksum could be compared against the memory checksum before final write to disk. If both checksums do not match, the system could flag the issue, even if it wasn't detected during the earlier write phase.
Improved Detection and Recovery
Currently, ZFS checksums the data as it is written to the disk, ensuring that any read of the data from disk can be compared to the checksum to detect corruption. However, it does not check for errors that may have been introduced in memory during the write process. A double-checking mechanism could flag errors earlier in the workflow, giving ZFS a more proactive ability to protect data.
By introducing an extra checksum at the memory layer, the system could also reduce the risk of silent corruption going unnoticed until a scrub operation is run or the data is actually needed.

this would be amazing but we have to consider that Memory errors could still corrupt data before it is even checksummed, and the extra checksum might only identify the issue once the data has been written or read incorrectly.

what I am suggesting:
Implementing double checksumming would require modifying ZFS code to include checksums at both the memory and disk levels. This could potentially be done by the community or through contributions from developers. My suggestion is offering a function to enable this on hosts that does not support ECC memory. Therefore this extra overhead and complexity would not affect users that do not need it. Currently people that run TrueNAS for example, if not using ECC memory are highly exposed to RAM error propagation to snapshots, metadata, snapshots and can lead to the pool to be unmountable thus having the data nearly unrecoverable. The Scrubbing process is the main culprit of propagating errors originated from the ram memory and leading data to unusable. This would be highly beneficial to ensure at least some level of reliability. At the moment TrueNAS is much less safe than EXT4 due to openZFS filesystem extra integrity requisites relying on ECC RAM. (currently there's no redundancy there for NON-ECC RAM)

robn · 2024-11-25T05:31:48Z

robn
Nov 25, 2024
Collaborator

You appear to be making some big claims here, through phrases like:

"critical redundancy issue"
"highly exposed to RAM error propagation"
"NO redundancy"
"much less safe than EXT4 due to openZFS filesystem extra integrity requisites relying on ECC RAM"

If you have data to support any of this, please show it. It's not worth discussing solutions until it's established that there is a problem, and I don't see it here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[critical redundancy issue] suggestion of double checksum for non-ECC ram systems (currently there's NO redundancy on scrubbing against NON-ECC RAM error propagation) #16804

{{title}}

Replies: 1 comment

{{title}}

Select a reply

[critical redundancy issue] suggestion of double checksum for non-ECC ram systems (currently there's NO redundancy on scrubbing against NON-ECC RAM error propagation) #16804

willianma Nov 25, 2024

Replies: 1 comment

robn Nov 25, 2024 Collaborator

willianma
Nov 25, 2024

robn
Nov 25, 2024
Collaborator