Wonyeong Jung
- Email: [email protected]
Dongwhee Kim
Paper title: [ISOCC'23] Synergistic Integration: An Optimal Combination of On-Die and Rank-Level ECC for Enhanced Reliability
Paper URL: https://ieeexplore.ieee.org/document/10396592
- Reading OD-ECC H-Matrix.txt
- Setting output function name: output.S file
- (Start loop) DDR5 ECC-DIMM setup
- Initialize all data in 10 chips to 0: Each chip has 136 bits of data + redundancy
- Error injection: Scenario-based Error injection
- Apply OD-ECC: Implementation
Apply the Hamming SEC code of (136, 128) to each chip
After running OD-ECC, the redundancy of OD-ECC does not come out of the chip (128bit data).
- Apply RL-ECC
16 Burst Length (BL) creates one memory transfer block (64B cacheline + 16B redundancy).
In DDR5 x4 DRAM, because of internal prefetching, only 64bit of data from each chip's 128bit data is actually transferred to the cache.
For this, create two memory transfer blocks for 128-bit data and compare them.
- Report CE/DUE/SDC results.
- (End loop) Derive final results.
- Num of rank: 1
- Beat length: 40 bit
- Burst length: 16
- Num of data chips: 8
- Num of parity chips: 2
- Num of DQ: 4 (x4 chip)
- OD-ECC: (136, 128) Hamming SEC code [1] 'or' SEC code with bounded_Fault [2]
- RL-ECC: Chipkill-correct ECC using RS (Reed-Solomon) code [3]
Applying Restrained mode [4]
- Detail examples are found in Configuration (ECC, Error pattern).pptx
- SE: per-chip Single bit Error
- MBBE: per-chip Muli bit Bounded Error
- SW: per-chip Single Word Error (all 4 bits)
- SP: per-chip Single Pin Error (More than 2 bits)
- CHIPKILL: per-chip error (All random, bit flips for 50%)
- $ make clean
- $ make
- $ python run.py
- [1] Hamming, Richard W. "Error detecting and error correcting codes." The Bell system technical journal 29.2 (1950): 147-160.
- [2] Criss, Kjersten, et al. "Improving memory reliability by bounding DRAM faults: DDR5 improved reliability features." The International Symposium on Memory Systems. 2020.
- [3] Reed, Irving S., and Gustave Solomon. "Polynomial codes over certain finite fields." Journal of the society for industrial and applied mathematics 8.2 (1960): 300-304.
- [4] Kim, Dongwhee, et al. "Unity ECC: Unified Memory Protection Against Bit and Chip Errors." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2023.