Skip to content

Latest commit

 

History

History
73 lines (56 loc) · 3.09 KB

README.md

File metadata and controls

73 lines (56 loc) · 3.09 KB

ISOCC'23-Fault bound

This project is licensed under the terms of the MIT license

Author

Wonyeong Jung

Dongwhee Kim

Paper title: [ISOCC'23] Synergistic Integration: An Optimal Combination of On-Die and Rank-Level ECC for Enhanced Reliability

Paper URL: (to be updated later)

Overview

An Overview of the Fault_bound

Code flows (Fault_sim.cpp)

    1. Reading OD-ECC H-Matrix.txt
    1. Setting output function name: output.S file
    1. (Start loop) DDR5 ECC-DIMM setup
    1. Initialize all data in 10 chips to 0: Each chip has 136 bits of data + redundancy
    1. Error injection: Scenario-based Error injection
    1. Apply OD-ECC: Implementation

Apply the Hamming SEC code of (136, 128) to each chip

After running OD-ECC, the redundancy of OD-ECC does not come out of the chip (128bit data).

    1. Apply RL-ECC

16 Burst Length (BL) creates one memory transfer block (64B cacheline + 16B redundancy).

In DDR5 x4 DRAM, because of internal prefetching, only 64bit of data from each chip's 128bit data is actually transferred to the cache.

For this, create two memory transfer blocks for 128-bit data and compare them.

    1. Report CE/DUE/SDC results.
    1. (End loop) Derive final results.

DIMM configuration (per-sub channel)

  • DDR5 ECC-DIMM
  • Num of rank: 1
  • Beat length: 40 bit
  • Burst length: 16
  • Num of data chips: 8
  • Num of parity chips: 2
  • Num of DQ: 4 (x4 chip)

ECC configuration

  • OD-ECC: (136, 128) Hamming SEC code [1] 'or' SEC code with bounded_Fault [2]
  • RL-ECC: Chipkill-correct ECC using RS (Reed-Solomon) code [3]

Applying Restrained mode [4]

  • Detail examples are found in Configuration (ECC, Error pattern).pptx

Error pattern configuration (2 chip errors)

  • SE: per-chip Single bit Error
  • MBBE: per-chip Muli bit Bounded Error
  • SW: per-chip Single Word Error (all 4 bits)
  • SP: per-chip Single Pin Error (More than 2 bits)
  • CHIPKILL: per-chip error (All random, bit flips for 50%)

Getting Started

  • $ make clean
  • $ make
  • $ python run.py

References

  • [1] Hamming, Richard W. "Error detecting and error correcting codes." The Bell system technical journal 29.2 (1950): 147-160.
  • [2] Criss, Kjersten, et al. "Improving memory reliability by bounding DRAM faults: DDR5 improved reliability features." The International Symposium on Memory Systems. 2020.
  • [3] Reed, Irving S., and Gustave Solomon. "Polynomial codes over certain finite fields." Journal of the society for industrial and applied mathematics 8.2 (1960): 300-304.
  • [4] Kim, Dongwhee, et al. "Unity ECC: Unified Memory Protection Against Bit and Chip Errors." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2023.