46
ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( [email protected] ) Department of Semiconductor Systems Engineering Sungkyunkwan University 2018 테스트기술 워크숍 (2018.10.23)

ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( [email protected] ) Department of Semiconductor

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

ECC for Reliable DRAM Operation using Spare Columns

Joon-Sung Yang( [email protected] )

Department of Semiconductor Systems EngineeringSungkyunkwan University

2018 테스트기술워크숍 (2018.10.23)

Page 2: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Era of Machine Learning / Big Data / IoT§ Convergence

q Computing, Communication Networks, Data Storage

Page 2

Low Power Operation

Page 3: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES LabPage 3

Data Centric

Page 4: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES LabPage 4

Data Centric → Data Ecosystem

Data Initiation§ Personal Devices§ Sensor Nodes

Data Transportation§ Connected Devices§ Network

Big Data Processing§ Data Center, Servers§ Connected Information

Page 5: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES LabPage 5

Operating Condition

Extreme Conditions§ Temp, Humidity, Vibration§ Battery Powered – Low Power

Connected Network§ Fast Data Handling§ Between Devices

High Performance Computing§ Extreme Data Volume§ Time Critical§ High Reliability

Base of System : H/W Platform↓

RELIABILITY

Page 6: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Data Centric Operation§ Data Processing Focused Operation

§ Need for Reliable and Energy Efficient Memory Operation

Page 6

High Performance Computing

Page 7: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Logic§ Time / Space Multiplexing

q Extreme Reliability§ Rollback

q Ex) Razor§ Worst Case Vector

q Exercising Most Weak Area

Page 7

Reliability / Dependable Computing

Page 8: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Memory§ Scrubbing

§ ECC (Error Correcting Code)q Various Codes Availableq Most Common q Message + Check-bit = Codeword

Page 8

Reliability / Dependable Computing

Page 9: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø ECC§ Chipkill by IBM

q Similar to RAID for DISK Subsystemq When Writing Data to DIMM, Duplicated Set of Data (as Checksum)

Written to Another Part of Memory Subsystem

Page 9

Reliable Memory Operation

Source : Dell Technology Brief

Page 10: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø ECC§ Chipkill by IBM

q In Memory Failure, Data Recovered by Re-calculation using Checksum

Information

q Correct, 1, 2, 3 and 4 Bit Errors

q Can Be Used Mission Critical Applications

Page 10

Reliable Memory Operation

Used NASA Mars Pathfinder Probe

Source : http://www-

05.ibm.com/hu/termekismertetok/xseries/dn/chipkill.pdf

Page 11: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Dynamic Refresh Rate Control§ RAIDR (Retention-Aware Intelligent DRAM Refresh)

q Increasing Refresh Rate for Weak Rows

q Decreasing Refresh Rate for Strong Rows

q 16.1% of Power Reduction (32GB DRAM)

q Cons : Does not Recover Transient Errors, Profiling Overhead

Page 11

Reliable Memory Operation

Source : RAIDR : Retention-Aware Intelligent DRAM Refresh by Jamie Liu (CMU) ISCA’12

Page 12: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Ability of a System To Continue Error-free Operation in Presence of Unexpected Fault

Ø Fault§ Permanent Fault

q Hard Fault / Hard Errorq Result from Manufacturing Defectsq Early Life Failures

§ Temporary Faultq Soft Error

– Present for a Short Timeq Transient (non-recurring) Error (Noise, Power)q Intermittent (recurring) Error (Timing)

Page 12

Fault Tolerance

Page 13: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Reliability vs. Power§ Reliability

q Add Check-bits– Area Overhead– High Power Consumption

q Increase Vddq Increase Refresh Rate

§ Reliability Inverse Relationship with Power

Page 13

Reliable Memory Operation

Page 14: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Technology Scaling§ DRAM

q Higher Soft Error Rateq Single Bit Upset -> Multiple Bit Upset

q Need for Error Correction

Page 14

Memory Reliability

Page 15: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Linear Code§ Any Linear Combination (XOR) of Codewords -> Codeword

Ø Block Code§ Same Length of Codewords§ Block Size based Decoding

Ø Linear Block Code§ Any Linear Combination (XOR) of Any Two Blocks -> Another

Block

Page 15

Linear Block Code

Page 16: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Separable Codeword§ n-bit code = k-bit message (information bit) + (n-k)-bit check bit§ (n, k) block code

Page 16

Linear Block Code

Data (Message)k-bit

Codewordn-bitDecoding

Encoding

Message(k-bit)

n-bit block

Check bit((n-k)-bit)

Codeword

Page 17: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Definition§ Redundancy = 1-log2(2k)/n§ Rate = k/n

§ Example : Single-bit Parity Code

q Redundancy = 1-log2(2k)/n = 1 – k/n = 1-3/4 = ¼q Rate = k/n = 3/4

Page 17

Linear Block Code

1 0 1 0

3-bit info

4-bit block

parity bit

Page 18: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Generator (G) and Parity Check (H) Matrix

Ø Code Generation§ c (codeword) = m (message) x G (Generator Matrix)

Ø Parity Check§ c x HT (Transposed Parity Check Matrix)

Page 18

Linear Block Code

Data (Message)k-bit

Codewordn-bitDecoding

Encoding

G-Matrix

H-Matrix

Page 19: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Example : (4, 3) Code§ 4-bit Codeword, 3-bit Message

q 1-bit Parity

§ m = 101 q Encoding : c (codeword) = m (message) x G (Generator Matrix)

– Codeword = 1010q Decoding : c x HT (Transposed Parity Check Matrix)

Page 19

Linear Block Code

Page 20: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Example : (7, 4) code§ 7-bit Codeword, 4-bit Message

q 3-bit Parity

q m = 0110 -> codeword = 0110011

Page 20

Linear Block Code

Page 21: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Example : (7, 4) code§ 7-bit Codeword, 4-bit Message

q Single-bit Error– codeword = 0110011 -> 1110011 (v)

q Single-bit Correction : 1110011 -> 011011 m = 0110

Page 21

Linear Block Code

Page 22: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Example : (7, 4) code§ 7-bit Codeword, 4-bit Message

q Double-bit Error– codeword = 0110011 -> 1010011 (v)

q Error Free?? -> 1010011 m = 0110– Miscorrection Problem

Page 22

Linear Block Code

Page 23: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Example : (7, 3) code§ 7-bit Codeword, 3-bit Message

q Double-bit Error– codeword = 0110011 -> 1010011 (v)

q Syndrome -> Non-Zero -> Double Error Detected– More Parity Needed!!

Page 23

Linear Block Code

Page 24: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø (n, k) Linear Block Code§ k Data Bits -> Encoded to n-Bit Codewords§ r Check Bits, r = n – k§ H-Matrix (Parity Check Matrix)

q r x nq C Codeword IFF H·CT = 0

§ Syndromeq S = H·Verror = H·(V⊕E) = H·V⊕ = H· Eq No Error (E = 0), S = 0

Page 24

Linear Block Code

Page 25: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø SEC-DED Codes§ Single Error Correction & Double Error Detection§ More than Two Bit Errors?

q Error Detection Not Guaranteedq Error Miscorrection Possibility

§ Solutionq Add More Check Bits

§ Exampleq (7, 3) SEC-DED Codes

Page 25

Linear Block Code

Page 26: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Example§ (7, 3) SEC-DED Hsiao Code

q 7C3 = (35) Possible 3-Bit Errorsq 28 out of 35 Miscorrection

Page 26

Linear Block Code

Page 27: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Example§ (7, 3) SEC-DED Hsiao Code

q Adding One More Row in H-Matrix -> Adding One More Check Bit

q 12 Miscorrection out of 56 (= 8C3) 3-Bit Errors

Page 27

Linear Block Code

Page 28: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø SEC-DED Codes§ Single Error Correction & Double Error Detection

§ More than Two Bit Errors?q Error Detection Not Guaranteed

q Error Miscorrection Possibility

§ Solutionq Add More Check Bits

§ Problemq Bigger Memory Array Required (Lower Memory Efficiency)

q Power Consumption

Page 28

Linear Block Code

Page 29: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Need to Achieve Reliable Memory Operation

Ø Need to Achieve Low Memory Overhead

Ø Reliable Memory Operation using Spare Memory Columns

Page 29

Linear Block Code

Page 30: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Spare Columns / Rows§ Resides in Memory for Repair§ Yield Enhancement

Page 30

Utilization of Spare Columns for ECC

Page 31: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Spare Columns§ May Exist after Repair§ WEARs : Working cElls After Repair

Page 31

Utilization of Spare Columns for ECC

Page 32: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Spare Columns§ May Exist after Repair§ WEARs : Working cElls After Repair

Ø Utilize Spare Columns and WEARs§ Store Additional Check-bits

→ Reliability Enhancement§ Without Increasing Memory Array Size

→ Effectively Reducing Power Consumption

Page 32

Utilization of Spare Columns for ECC

Page 33: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Method-1§ Storing Repair Information in Spare Column

Page 33

Utilization of Spare Columns for ECC

15 Additional Check-Bits!!

Page 34: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Method-1§ Size of Additional Check Bits (n Bit Codeword, r Total Spare

Columns, m Used Spare Columns)

q Increased Codeword Size for Row without Defective Cellsn-Bit Codeword + (r - 1)-Bit Additional Check bit

q Increased Codeword Size for Row without Defective Cellsn-Bit Codeword + (r - m - 1)-Bit Additional Check bitPage 34

Utilization of Spare Columns for ECC

Page 35: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Method-1§ Architecture

Page 35

Utilization of Spare Columns for ECC

Rec

onfig

urat

ion

Logi

c

Che

ck B

it G

ener

ator

01

01

Memory SpareColumns

Syndrome Generator

Correction Logic

Dat

a B

its

ECC Enhanced Codeword

MaskingInfo

+ Error Detected

Additional Check Bits : # of Spare Columns -1

Page 36: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Comparison

Page 36

Reliability Enhancement Result

Before Repair

After Repair (Logical)

Codewords

Syndrome Generator

Generated Syndrome Not Using Last Two Bits

Page 37: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Comparison

Page 37

Reliability Enhancement Result

Before Repair

After Repair (Logical)

Codewords

Syndrome Generator

Generated Syndrome by Method-1

Page 38: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Comparison § MTBER (Maximally Tolerable Bit Error Rate)

Page 38

Reliability Enhancement Result

1K x 32bit 8K x 64bit

Page 39: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Method-2§ Overcoming Method-1 Limitation§ Using CAM (Content-Addressable Memory)

Page 39

Utilization of Spare Columns for ECC

22 Additional Check-Bits!!

Page 40: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Method-2§ Using CAM (Content-Addressable Memory)

Page 40

Utilization of Spare Columns for ECC

WORD 0WORD 1WORD 2WORD 3

WORD w-2WORD w-1

+

+

1-bit discard

n-bit discard

search data register

Word Address

matchlines

searchlines

n: number of spare columns in memory

Page 41: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Method-2§ Architecture

Page 41

Utilization of Spare Columns for ECC

Rec

onfig

urat

ion

Logi

c

Che

ck B

it G

ener

ator

MemorySpare

Columns

Dat

a B

its

ECC Enhanced Codeword

Encode

Con

tent

Add

ress

able

M

emor

yDefect Information

Syndrome Generator

Correction Logic

+ Error Detected

Additional Check Bits :# of Spare Columns

+ + +

Page 42: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Comparison

Page 42

Reliability Enhancement Result

Before Repair

After Repair (Logical)

Codewords

Syndrome Generator

Generated Syndrome by Method-2

Page 43: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø ComparisonØ MTBER (Maximally Tolerable Bit Error Rate)

Page 43

Reliability Enhancement Result

1K x 32bit 8K x 64bit

Page 44: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Area Overhead§ Enlarging Memory Array vs. Using Additional CAM

Page 44

Reliability Enhancement Result

Page 45: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Power Overhead Estimation§ Enlarging Memory Array vs. Using Additional CAM

q CAM Requiring 1e-6 Power Consumptionq No Performance Overhead

Page 45

Reliability Enhancement Result

CAM - ISSCC [15]Capacity 128x64bLatency 0.96ns

Energy/32b 16.32 fJ

DDR 3tCL (param.) 13.75ns

tRCD (param.) 13.75nsRead (Energy) 18 nJWrite (Energy) 20 nJ

Page 46: ECC for Reliable DRAM Operation using Spare Columns · 2018. 10. 25. · ECC for Reliable DRAM Operation using Spare Columns Joon-Sung Yang ( js.yang@skku.edu ) Department of Semiconductor

DATES Lab

Ø Q&A?

Page 46

Thank you