An example of a Reed-Solomon based error-correcting block device backed by RAM.
corrupted:
' ..:::::7::::b:.. ' ' 8 .:::....:::.96a:9:859.e:.dfbed5e 94ab8 4c:.4
. 48 .:::::::::::::::::::.. ' :::::::::::::44ef696b 6d'6:e:79e5.748f.8fc'9
8 ' .::::::::::':::':::::::::. 8::::::::::' e75a6ac555.f8cf8.b:9'ed57ad4a ec
.:::::::::::::::::::::::::::. ' . ' . :::::::::: 'c4ddf9fd:9.7e ef 7a5c 7fa6'c4'
:::::'::::::7:::7.::::::::::::: ... :: :':::'::' bb'c:da5a7 d6b.87d5b57b 9f7a: a:
4::::::::::::::'::::::::::::::'' .. '::: ' ''' '7 aece5:de .68467c79fdd'49:8597
::::::::::::::::::::::'::::''.:::::: ' . . fb6b.6:'44f4'dcdf5e'.fac..:6fa8e
:7:::::::::::::::::::::'' ..:::::'' ' ' dbaa4ba6e5a'c9 5a adb5e:7f8b554e
8 8:::::::::::::::::::'' ..:::::'' .. . .' 'be4d.dd4b6c64:a5e'ba4:b578d :
8 ..: ::::::::::::::''' ..:::::'' ..::: .a55f.cf:4':9.b8:44ea9c5bfd9c7e.
8.::::' :::::::::'' ..::::::'' ..::::::: . f8.'c7d.c9b:49a'48a:7 5c:9cc.6:b
:::'4 '7::'' ...::::::''.5.:::::::::: . edfe94e8:.b4ff':c655 5d6f7f5d:7'
::' ...::::::6' ..:::::::::::::' 8 8 c88f.'96f8fbe5787bbad4f.bf:d9b
9...:::::::'' ..::::::::::'::::.:' 4 ebe5eecda4fdee9b5:58:ee f:cafe::
'::::::::'''4. ::::::::::::::::7::'' . 8 ece:'b 6:7:a7.e5a'6a dfe'b967fe:
'':::::::::::''' .. '99ade89:df5 7b7b e475ad:c98efaac
corrected:
..:::::::::::... .:::....:::.96a:9:859.e:.dfbed5e 94ab c:.4
.:::::::::::::::::::.. :::::::::::::44ef696b 6d'6:e:79e5.748f.afc'9
.::::::::::::::::::::::::. ::::::::::' e75a6ac555.f8cf8.b:9'ad56ad5a ec
.:::::::::::::::::::::::::::. . :::::::::: 'c4dcf9fd:9.7e ef 7a5c67fa6'c4'
.:::::::::::::::::::::::::::::: ... :: :'::::::' bb'c:da5a74d6b.87d5b57b 9f7a: a:
:::::::::::::::::::::::::::::'' .. '::: ' ''' '7 aece5:de .68467c:9fdd'c9:8597
:::::::::::::::::::::::::::''..::::: ' fb6b.6:'44f4'dcdf5e'.fac..:6fa8e
:::::::::::::::::::::::'' ..:::::'' dbaa4ba6e5''c9 5b.adb5e:6f8b554e
:::::::::::::::::::'' ..:::::'' . ' 'be4d.dd4b6c64:a5ea.a4:b578d :
..: ::::::::::::::''' ..:::::'' ..::: .a55f.cf:4':9.b8:446a8c5bfd9c7e.
..:::' :::::::::'' ..::::::'' ..::::::: f8 'c6d.c9b:49a'48a:7 5c79cc.6:b
:::' ':::'' ...::::::''...:::::::::: edfe94e8:.b4ff':c65585d'f7b5d:7'
::' ...::::::'' ..:::::::::::::' c88f:'96f8fbc5787bbad4f.bf:d9b8
....:::::::'' ..:::::::::::::::::' ebe5eecda4fdee8b5:58:ee e:cafe::
'::::::::''' :::::::::::::::::::'' ece:'b 6:7:af9e5a'6' dfe'b967fe:
'':::::::::::''' 99adf89:df5 7b7b e475ad:c98efaac
Reed-Solomon codes are sort of the big brother to CRCs. Operating on bytes instead of bits, they are both flexible and powerful, capable of detecting and correcting a configurable number of byte errors efficiently.
Assuming ecc_size
bytes of ECC, ramrsbd can correct floor(ecc_size/2)
byte errors in up to 255 byte codewords (yes, 255 bytes, not 256 :/).
The main tradeoff is they are quite a bit more complex mathematically, which adds some code, RAM, and brain cost.
A quick comparison of current ram-ecc-bds:
code | tables | stack | buffers | runtime | |
---|---|---|---|---|---|
ramcrc32bd | 940 B | 64 B | 88 B | 0 B | |
ramrsbd | 1506 B | 512 B | 128 B | n + 4e B |
See also:
Right now, littlefs's block device API is limited in terms of composability. While it would be great to fix this on a major API change, in the meantime, a RAM-backed block device provides a simple example of error-correction that users may be able to reimplement in their own block devices.
Testing is a bit jank right now, relying on littlefs's test runner:
$ git clone https://github.com/littlefs-project/littlefs -b v2.9.3 --depth 1
$ make test -j
Before we get into how the algorithm works, a couple words of warning:
-
I'm not a mathematician! Some of the definitions here are a bit handwavey, and I'm skipping over the history of BCH codes, PGZ, Euclidean methods, etc. I'd encourage you to also explore Wikipedia and other relevant articles to learn more.
My goal is to explain, to the best of my (limited) knowledge, how to implement Reed-Solomon codes, and how/why they work.
-
The following math relies heavily on finite-fields (sometimes called Galois-fields) and the related theory.
If you're not familiar with finite-fields, they are an abstraction we can make over finite numbers (bytes for GF(256), bits for GF(2)) that let us do most of math without worrying about pesky things like integer overflow.
But there's not enough space here to fully explain how they work, so I'd suggest reading some of the above articles first.
-
This is some pretty intense math! Reed-Solomon has evolved over several decades and many PhDs. Just look how many names are involved in this thing. So don't get discouraged if it feels impenetrable.
Feel free to take a break to remind yourself that math is not real and can not hurt you.
I'd also encourage you to use GitHub's table-of-contents to jump around and keep track of where you are.
Like CRCs, Reed-Solomon codes are implemented by concatenating the message with the remainder after division by a predetermined "generator polynomial", giving us a systematic code.
However, two important differences:
-
Instead of using a binary polynomial in GF(2), we use a polynomial in a higher-order finite-field, usually GF(256) because operating on bytes is convenient.
-
We intentionally construct the polynomial to tell us information about any errors that may occur.
Specifically, we constrain our polynomial (and implicitly our codewords) to evaluate to zero at
$n$ fixed points. As long as we have$e \le \frac{n}{2}$ errors, we can solve for errors using these fixed points.Note this isn't really possible with CRCs in GF(2), because the only non-zero binary number is, well, 1. GF(256) has 255 non-zero elements, which we'll see becomes quite important.
Ok, first step, constructing a generator polynomial.
If we want to correct
We could choose any arbitrary set of fixed points, but usually we choose
Note that for any fixed point
And since multiplying anything by zero is zero, this will make our entire
product zero. So for any fixed point
This gets real nifty when you look at the definition of our Reed-Solomon
code for codeword
As is true with normal math, subtracting the remainder after division
gives us a polynomial that is a multiple of
Ok, but what if there are errors?
We can think of introducing errors as adding an error polynomial
Check out what happens if we plug in our fixed point
The original codeword drops out! Leaving us with an equation defined only by the error polynomial.
We call these evaluations our "syndromes"
We usually refer to the unknowns in this equation as the
"error-locations"
Note that finding
If we can figure out both the error-locations and error-magnitudes, we have enough information to reconstruct our original codeword:
Ok, let's say we received a codeword
The next step is figuring out the error-locations
To help with this, we introduce a very special polynomial, the
"error-locator polynomial"
This polynomial has some rather useful properties:
-
For any error-location
$X_j$ ,$\Lambda(X_j^{-1})$ evaluates to zero:This is for similar reasons why
$P(g^i) = 0$ . For any error-location$X_j$ :And since multiplying anything by zero is zero, the product reduces to zero.
-
$\Lambda(0)$ evaluates to one:This can be seen by plugging in 0:
This prevents trivial solutions and is what makes
$\Lambda(x)$ useful.
What's really interesting is that these two properties allow us to
solve for
We know
Note this doesn't actually change our error-locator
And since multiplying anything by zero is zero, we can multiply this by,
say,
We can even add a bunch of these together and the result should still be zero:
Wait a second...
Aren't these our syndromes?
They are! We can rearrange this into an equation for
If we repeat this
This is where the
Ok that's the theory, but solving this system of equations efficiently is still quite difficult.
Enter Berlekamp-Massey.
A key observation by Massey is that solving for
.---- + <- + <- + <- + <--- ... --- + <--.
| ^ ^ ^ ^ ^ |
| *Λ1 *Λ2 *Λ3 *Λ4 ... *Λe-1 *Λe
| ^ ^ ^ ^ ^ ^
| .-|--.-|--.-|--.-|--.-- --.-|--.-|--.
'-> |Se-1|Se-2|Se-3|Se-4| ... | S1 | S0 | -> Sn-1 Sn-2 ... S2+3 Se+2 Se+1 Se Se-1 Se-2 ... S3 S2 S1 S0
'----'----'----'----'-- --'----'----'
Pretty wild huh.
We can describe such an LFSR with a recurrence relation that might look a bit familiar:
Berlekamp-Massey relies on two key observations:
-
If an LFSR
$L(i)$ of size$|L|$ generates the sequence$s_0, s_1, \dots, s_{n-1}$ , but fails to generate the sequence$s_0, s_1, \dots, s_{n-1}, s_n$ , than an LFSR$L'(i)$ that does generate the sequence must have a size of at least:Massey's proof of this is very math heavy.
Consider the equation for our LFSR
$L(i)$ at$n$ :If we have another LFSR
$L'(i)$ that generates$s_{n-|L|}, s_{n-|L|+1}, \cdots, s_{n-1}$ , we can substitute it in for$s_{n-k}$ :Multiplication is distributive, so we can move our summations around:
And note that right summation looks a lot like
$L(i)$ . If$L(i)$ generates$s_{n-|L'|}, s_{n-|L'|+1}, \cdots, s_{n-1}$ , we can replace it with$s_{n-k'}$ :Oh hey! That's the definition of
$L'(i)$ :So if
$L'(i)$ generates$s_n$ ,$L(i)$ must also generate$s_n$ .The only way to make
$L'(i)$ generate a different$s_n$ would be to make$|L'| \ge n+1-|L|$ so that$L(i)$ can no longer generate$s_{n-|L'|}, s_{n-|L'|+1}, \cdots, s_{n-1}$ . -
Once we've found the best LFSR
$L(i)$ for a given size$|L|$ , its definition provides an optimal strategy for changing only the last element of the generated sequence.This is assuming
$L(i)$ failed of course. If$L(i)$ generated the whole sequence our algorithm is done!If
$L(i)$ failed, we assume it correctly generated$s_0, s_1, \cdots, s_{n-1}$ , but failed at$s_n$ . We call the difference from the expected symbol the discrepancy$d$ :If we know
$s_i$ (which requires a larger LFSR), we can rearrange this to be a bit more useful. We call this our connection polynomial$C(i)$ :Now, if we have a larger LFSR
$L'(i)$ with size$|L'| \gt |L|$ and we want to change only the symbol$s'_n$ by$d'$ , we can add$d' \cdot C(i)$ , and only$s'_n$ will be affected:
If you can wrap your head around those two observations, you have understood most of Berlekamp-Massey.
The actual algorithm itself is relatively simple:
-
Using the current LFSR
$L(i)$ , generate the next symbol and calculate the discrepancy$d$ between it and the expected symbol$s_n$ : -
If
$d=0$ , great! Move on to the next symbol. -
If
$d \ne 0$ , we need to tweak our LFSR:-
First check if our LFSR is big enough. If
$n \ge 2|L|$ , we need a bigger LFSR:If we're changing the size, save the best LFSR at the current size for future tweaks:
-
Now we can fix the LFSR by adding our last
$C(i)$ (not$C'(i)$ !), shifting and scaling so only$s_n$ is affected:Where
$m$ is the value of$n$ when we saved the last$C(i)$ . If we shift$C(i)$ every step of the algorithm, we usually don't need to track$m$ explicitly.
-
This is all implemented in ramrsbd_find_λ
.
Taking a step away from GF(256) for a moment, let's look at a simpler LFSR in GF(2), aka binary.
Consider this binary sequence generated by a minimal LFSR that I know and you don't :)
1 1 0 0 1 1 1 1
Can you figure out the original LFSR?
Click here to solve with Berlekamp-Massey
Using Berlekamp-Massey:
To start, let's assume our LFSR is an empty LFSR that just spits out a stream of zeros. Not the most creative LFSR, but we have to start somewhere!
|L0| = 0
L0(i) = 0
C0(i) = s_i
L0 = 0 -> Output: 0
Expected: 1
d = 1
Ok, so immediately we see a discrepancy. Clearly our output is not a string of all zeros, and we need some LFSR:
|L1| = 0+1-|L0| = 1
L1(i) = L0(i) + C0(i-1) = s_i-1
C1(i) = s_i + L0(i) = s_i
.-----.
| |
| .-|--.
L1 = '-> | 1 |-> Output: 1 1
'----' Expected: 1 1
d = 0
That's looking much better. This little LFSR will actually get us decently far into the sequence:
|L2| = |L1| = 1
L2(i) = L1(i) = s_i-1
C2(i) = C1(i-1) = s_i-1
.-----.
| |
| .-|--.
L2 = '-> | 1 |-> Output: 1 1 1
'----' Expected: 1 1 1
d = 0
|L3| = |L2| = 1
L3(i) = L2(i) = s_i-1
C3(i) = C2(i-1) = s_i-2
.-----.
| |
| .-|--.
L3 = '-> | 1 |-> Output: 1 1 1 1
'----' Expected: 1 1 1 1
d = 0
|L4| = |L3| = 1
L4(i) = L3(i) = s_i-1
C4(i) = C3(i-1) = s_i-3
.-----.
| |
| .-|--.
L4 = '-> | 1 |-> Output: 1 1 1 1 1
'----' Expected: 0 1 1 1 1
d = 1
Ah! A discrepancy!
We're now at step 4 with only a 1-bit LFSR.
Resizing our LFSR to
|L5| = 4+1-|L4| = 4
L5(i) = L4(i) + C4(i-1) = s_i-1 + s_i-4
C5(i) = s_i + L4(i) = s_i + s_i-1
.---- + <------------.
| ^ |
| .-|--.----.----.-|--. Expected: 0 0 1 1 1 1
L5 = '-> | 1 | 1 | 1 | 1 |-> Output: 1 0 1 1 1 1
'----'----'----'----' d = 1
Another discrepancy! This time we don't need to resize the LFSR, just add
the shifted
Thanks to math, we know this should have no affect on any of the previously generated symbols, but feel free to regenerate the sequence to prove this to yourself. This property is pretty unintuitive!
|L6| = |L5| = 4
L6(i) = L5(i) + C5(i-1) = s_i-2 + s_i-4
C6(i) = C5(i-1) = s_i-1 + s_i-2
.--------- + <-------.
| ^ |
| .----.-|--.----.-|--.
L6 = '-> | 1 | 1 | 1 | 1 |-> Output: 1 0 0 1 1 1 1
'----'----'----'----' Expected: 1 0 0 1 1 1 1
d = 0
No discrepancy this time, let's keep going:
|L7| = |L6| = 4
L7(i) = L6(i) = s_i-2 + s_i-4
C7(i) = C6(i-1) = s_i-2 + s_i-3
.--------- + <-------.
| ^ |
| .----.-|--.----.-|--.
L7 = '-> | 1 | 1 | 1 | 1 |-> Output: 1 1 0 0 1 1 1 1
'----'----'----'----' Expected: 1 1 0 0 1 1 1 1
d = 0
And now that we've generated the whole sequence, we have our LFSR:
|L8| = |L7| = 4
L8(i) = L7(i) = s_i-2 + s_i-4
.--------- + <-------.
| ^ |
| .----.-|--.----.-|--.
L8 = '-> | 1 | 1 | 1 | 1 |-> Output: 1 1 0 0 1 1 1 1
'----'----'----'----' Expected: 1 1 0 0 1 1 1 1
In case you want to play around with this algorithm more (and for my own experiments), I've ported this LFSR solver to Python in bm-lfsr-solver.py. Feel free to try your own binary sequences:
$ ./bm-lfsr-solver.py 1 1 0 0 1 1 1 1
... snip ...
.--------- + <-------.
| ^ |
| .----.-|--.----.-|--.
L8 = '-> | 1 | 1 | 1 | 1 |-> Output: 1 1 0 0 1 1 1 1
'----'----'----'----' Expected: 1 1 0 0 1 1 1 1
$ ./bm-lfsr-solver.py 01101000 01101001 00100001
... snip ...
.---- + <---------------- + <- + <- + <------ + <-------.
| ^ ^ ^ ^ ^ |
| .-|--.----.----.----.-|--.-|--.-|--.----.-|--.----.-|--.----.
L24 = '-> | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |-> Output: 0 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 1
'----'----'----'----'----'----'----'----'----'----'----'----' Expected: 0 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 0 0 1
I've also implemented a similar script, bm-lfsr256-solver.py, for full GF(256) LFSRs, though it's a bit harder for follow unless you can do full GF(256) multiplications in your head:
$ ./bm-lfsr256-solver.py 30 80 86 cb a3 78 8e 00
... snip ...
.---- + <- + <- + <--.
| ^ ^ ^ |
| *f0 *04 *df *ea
| ^ ^ ^ ^
| .-|--.-|--.-|--.-|--.
L8 = '-> | a3 | 78 | 8e | 00 |-> Output: 30 80 86 cb a3 78 8e 00
'----'----'----'----' Expected: 30 80 86 cb a3 78 8e 00
Is Berlekamp-Massey a good compression algorithm? Probably not.
Coming back to Reed-Solomon, thanks to Berlekamp-Massey we can solve the
following recurrence for the terms
These terms define our error-locator polynomial, which we can use to find the locations of errors:
All we have left to do is figure out where
The easiest way to do this is just brute force: Just plug in every
location
Wikipedia and other resources often mention an optimization called
Chien's search being applied here, but from reading up on the
algorithm it seems to only be useful for hardware implementations. In
software Chien's search doesn't actually improve our runtime over brute
force with Horner's method and log tables (
Once we've found the error-locations
This step is relatively straightforward... sort of...
Recall the definition of our syndromes
With
But again, solving this system of equations is easier said than done.
Enter Forney's algorithm.
Assuming we know an error-locator
Where
And
Though note
The end result is a simple formula for our error-magnitudes
Haha, I know right? Where did this equation come from? Why does it work? How did Forney even come up with this?
I don't know the answer to most of these questions, there's very little information online about where/how/what this formula comes from.
But at the very least we can prove that it does work.
Let's start with the syndrome polynomial
Substituting in the definition of our syndromes
The sum on the right turns out to be a geometric series:
If we then multiply with our error-locator polynomial
We see exactly one term in each summand cancel out, the term where
At this point, if we plug in
But if we expand the multiplication, something interesting happens:
On the left side of the subtraction, all terms are
Imagine how these both contribute to the expanded form of the equation:
If we truncate this polynomial,
Giving us an equation for the error-evaluator polynomial,
Note that
The error-evaluator polynomial
And right there is our error-magnitude
The good news is that gobbledygook depends only on our error-locations
But Forney has one last trick up his sleeve: The
formal derivative of the error-locator polynomial,
What the heck is a formal derivative?
Well we can't use normal derivatives in a finite-field like GF(256) because they depend on the notion of a limit which depends on the field being, well, not finite.
But derivatives are so useful mathematicians use them anyways.
Applying a formal derivative looks a lot like a normal derivative in normal math:
Except
Quite a few properties of derivatives still hold in finite-fields. Of particular interest to us is the product rule:
Applying this to our error-locator polynomial
Recall the other definition of our error-locator polynomial
Applying the product rule:
Starting to look familiar?
Just like the error-evaluator polynomial
So for a given error-location
And the formal derivative of the error-locator polynomial
If we divide
All that's left is to cancel out the
Once we've figured out the error-locator polynomial
For each location
To fix the error, plug the error-location
Repeat for all errors in the malformed codeword
Unfortunately we're not quite done yet. All of this math assumed we had
It's worth recalculating the syndromes after fixing errors to see if we did ended up with a valid codeword:
If the syndromes are all zero, chances are high we successfully repaired our codeword.
Unless of course we had enough errors to end up overcorrecting to a different codeword, but there's not much we can do in that case. No error-correction is perfect.
This is all implemented in ramrsbd_read
, if you're curious what it
looks like in code.
Heavy math aside, there are a couple minor implementation tricks worth noting:
-
Truncate the generator polynomial to
ecc_size
.Calculating the generator polynomial for
$n$ bytes of ECC gives us a polynomial with$n+1$ terms. This is a bit annoying sinceecc_size
is often a power-of-two:Fortunately, because of math reasons, the first term will always be 1. So, just like with our CRC polynomials, we can leave off the leading 1 and make it implicit.
Division with an implicit 1 is implemented in
ramrsbd_gf_p_divmod1
, which has the extra benefit of being able to skip the normalization step during synthetic division, so that's nice. -
Store the generator polynomial in ROM.
We don't really need to recompute the generator polynomial every time we initialize the block device, it's just convenient API-wise when we don't know
ecc_size
.If you only need to support a fixed set of block device geometries, precomputing and storing the generator polynomial in ROM will save a couple (
ecc_size
) bytes of RAM.rs-poly.py can help generate the generator polynomial:
$ ./rs-poly.py 8 // generator polynomial for ecc_size=8 // // P(x) = prod_i^n-1 (x - g^i) // static const uint8_t RAMRSBD_P[8] = { 0xff, 0x0b, 0x51, 0x36, 0xef, 0xad, 0xc8, 0x18, };
Which can then be provided to ramrsbd's
p
config option to avoid allocating thep_buffer
in RAM.Unfortunately we still pay the code cost for generating the generator polynomial, which is difficult to avoid with the current API.
-
Minimizing the number of polynomial buffers.
We have quite a few polynomials flying around:
-
$C(x)$ - Codeword buffer -code_size
bytes -
$P(x)$ Generator polynomial -ecc_size
bytes (truncated) -
$S_i$ - Syndrome buffer -ecc_size
bytes -
$\Lambda(x)$ - Error-locator polynomial -ecc_size/2+1
(<=ecc_size
) bytes -
$C(i)$ - Connection polynomial -ecc_size/2+1
(<=ecc_size
) bytes -
$\Omega(x)$ - Error-evaluator polynomial -ecc_size/2
(<=ecc_size
) bytes -
$\Lambda'(x)$ - Derivative of the error-locator polynomial -ecc_size/2
(<=ecc_size
) bytes
These get a bit annoying in a malloc-less system.
Fortunately, there are a couple places we can reuse buffers:
-
The connection polynomial
$C(i)$ is only needed for Berlekamp-Massey and we can throw it away as soon as the error-locator$\Lambda(x)$ is found. -
We only need the syndrome buffer
$S_i$ to find$\Lambda(x)$ ,$\Omega(x)$ , and$\Lambda'(x)$ . After that we have everything we need to start fixing errors.
By sharing these buffers with polynomials computed later, such as the error-evaluator
$\Omega(x)$ and derivative of the error-locator$\Lambda'(x)$ , we can reduce the total number of buffers needed down to 1code_size
buffer and 4ecc_size
buffers (3 if the generator polynomial is stored in ROM). -
-
Fused derivative evaluation.
The formal derivative is a funny operation:
Unlike other steps, it's a simple transformation that only affects one term at a time. This makes it easy to apply lazily without needing to copy the original polynomial.
We already need to keep the error-locator polynomial
$\Lambda(x)$ around to, you know, locate the errors. So if we merge the derivative and evaluation of the error-locator into a single operation, we don't actually need to store the derivative of the error-locator$\Lambda'(x)$ as a separate polynomial. In theory reducing the number of buffers needed during error-evaluation from 3 down to 2.This sort of fused derivative evaluation is implemented in
ramrsbd_gf_p_deval
via a modified Horner's method.Unfortunately, we still need at least 3 buffers for Berlekamp-Massey (
$S_i$ ,$\Lambda(i)$ , and$C(i)$ ), so this doesn't actually save us anything.But it doesn't really cost us anything either, cleans up the code a little bit, and lets us avoid clobbering the syndrome buffer
$S_i$ which is useful for debugging.
And some caveats:
-
For any error-correcting code, attempting to correct errors reduces the code's ability to detect errors.
In Reed-Solomon's case, for
$n$ bytes of ECC, we can detect up to$n$ byte-errors, but only correct up to$\left\lfloor\frac{n}{2}\right\rfloor$ byte-errors. Attempting to correct more errors can cause us to end up with a valid, but wrong, codeword.In practice this isn't that big of a problem. Fewer byte-errors are more common, and correcting byte-errors is usually more useful. At
$n+1$ byte-errors you're going to end up with undetectable errors anyways.Still, it's good to be aware of this tradeoff.
ramrsbd's
error_correction
config option lets you control exactly how many byte-errors to attempt to repair in case better detection is more useful. -
Limited to 255 byte codewords - the non-zero elements of GF(256).
An important step in Reed-Solomon is mapping each possible error location to a non-zero element of our finite field
$X_j=g^j$ . Unfortunately our finite-field is, well, finite, so there's only so many non-zero elements we can use before error-locations start to alias.This gives us a maximum codeword size of 255 bytes in GF(256), including the bytes used for ECC. A bit annoying, but math is math.
In theory you can increase the maximum codeword size by using a larger finite-field, but this gets a bit tricky because the log/pow table approach used in ramrsbd stops being practical. 512 bytes of tables for GF(256) is fine, but 128 KiBs of tables for GF(2^16)? Not so much...
-
If you have carryless-multiplication hardware available, GF(2^n) multiplication can be implemented efficiently by combining multiplication and Barret reduction.
Division can then be implemented on top of multiplication by leveraging the fact that
$a^{2^n-2} = a^{-1}$ for any element$a$ in GF(2^n). Binary exponentiation can make this somewhat efficient. -
In the same way GF(256) is defined as an extension field of GF(2), we can define GF(2^16) as an extension field of GF(256), where each element is a 2 byte polynomial containing digits in GF(256).
This can be convenient if you already need GF(256) tables for other parts of the codebase.
Or, a simpler alternative, you can just pack multiple "physical" codewords into one "logical" codeword.
You could even consider interleaving the physical codewords if you want to maintain the systematic encoding or are trying to protect against specific error patterns.
-
-
Support for known-location "erasures" left as an exercise for the reader.
All of the above math assumes we don't know the location of errors, which is the most common case for block devices.
But it turns out if we do know the location of errors, via parity bits or some other side-channel, we can do quite a bit better. We usually call these known-location errors "erasures".
With Reed-Solomon, each unknown-location error requires 2 bytes of ECC to find and repair, while known-location erasures require only 1 byte of ECC to repair. You can even mix and match
$e$ errors and$f$ erasures as long as you have$n$ bytes of ECC such that:This isn't implemented in ramrsbd, but, in theory, the math isn't too difficult to extend.
First note we can split
$\Lambda(x)$ into a separate error-locator poylnomial$\Lambda_E(x)$ and erasure-locator polynomial$\Lambda_F(x)$ :We know the location of the known-location erasures, so the erasure-locator
$\Lambda_F(x)$ is trivial to calculate:Before we can find the error-locator polynomial
$\Lambda_E(x)$ , we need to modify our syndromes the hide the effects of the erasure-locator polynomial. These are often called the Forney syndromes$S_{Fi}$ :Note that the Forney syndromes
$S_{Fi}$ sill satisfy the equation for$\Omega(x)$ :We can then use Berlekamp-Massey with the Forney syndromes
$S_{Fi}$ to find the error-locator polynomial$\Lambda_E(x)$ .Combining the error-locator polynomial
$\Lambda_E(x)$ and the erasure-locator polynomial$\Lambda_F(x)$ gives us the creatively named error-and-erasure-locator-polynomial$\Lambda(x)$ , which contains everything we need to know to find the location of both errors and erasures:At this point we can continue Reed-Solomon as normal, finding the error/easures locations where
$\Lambda(X_j^{-1})=0$ , and repairing them with Forney's algorithm,$Y_j = X_j \frac{\Omega(X_j^{-1})}{\Lambda'(X_j^{-1})}$ :