Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
ECC for Reliable DRAM Operation using Spare Columns
Joon-Sung Yang( [email protected] )
Department of Semiconductor Systems EngineeringSungkyunkwan University
2018 테스트기술워크숍 (2018.10.23)
DATES Lab
Ø Era of Machine Learning / Big Data / IoT§ Convergence
q Computing, Communication Networks, Data Storage
Page 2
Low Power Operation
DATES LabPage 3
Data Centric
DATES LabPage 4
Data Centric → Data Ecosystem
Data Initiation§ Personal Devices§ Sensor Nodes
Data Transportation§ Connected Devices§ Network
Big Data Processing§ Data Center, Servers§ Connected Information
DATES LabPage 5
Operating Condition
Extreme Conditions§ Temp, Humidity, Vibration§ Battery Powered – Low Power
Connected Network§ Fast Data Handling§ Between Devices
High Performance Computing§ Extreme Data Volume§ Time Critical§ High Reliability
Base of System : H/W Platform↓
RELIABILITY
DATES Lab
Ø Data Centric Operation§ Data Processing Focused Operation
§ Need for Reliable and Energy Efficient Memory Operation
Page 6
High Performance Computing
DATES Lab
Ø Logic§ Time / Space Multiplexing
q Extreme Reliability§ Rollback
q Ex) Razor§ Worst Case Vector
q Exercising Most Weak Area
Page 7
Reliability / Dependable Computing
DATES Lab
Ø Memory§ Scrubbing
§ ECC (Error Correcting Code)q Various Codes Availableq Most Common q Message + Check-bit = Codeword
Page 8
Reliability / Dependable Computing
DATES Lab
Ø ECC§ Chipkill by IBM
q Similar to RAID for DISK Subsystemq When Writing Data to DIMM, Duplicated Set of Data (as Checksum)
Written to Another Part of Memory Subsystem
Page 9
Reliable Memory Operation
Source : Dell Technology Brief
DATES Lab
Ø ECC§ Chipkill by IBM
q In Memory Failure, Data Recovered by Re-calculation using Checksum
Information
q Correct, 1, 2, 3 and 4 Bit Errors
q Can Be Used Mission Critical Applications
Page 10
Reliable Memory Operation
Used NASA Mars Pathfinder Probe
Source : http://www-
05.ibm.com/hu/termekismertetok/xseries/dn/chipkill.pdf
DATES Lab
Ø Dynamic Refresh Rate Control§ RAIDR (Retention-Aware Intelligent DRAM Refresh)
q Increasing Refresh Rate for Weak Rows
q Decreasing Refresh Rate for Strong Rows
q 16.1% of Power Reduction (32GB DRAM)
q Cons : Does not Recover Transient Errors, Profiling Overhead
Page 11
Reliable Memory Operation
Source : RAIDR : Retention-Aware Intelligent DRAM Refresh by Jamie Liu (CMU) ISCA’12
DATES Lab
Ø Ability of a System To Continue Error-free Operation in Presence of Unexpected Fault
Ø Fault§ Permanent Fault
q Hard Fault / Hard Errorq Result from Manufacturing Defectsq Early Life Failures
§ Temporary Faultq Soft Error
– Present for a Short Timeq Transient (non-recurring) Error (Noise, Power)q Intermittent (recurring) Error (Timing)
Page 12
Fault Tolerance
DATES Lab
Ø Reliability vs. Power§ Reliability
q Add Check-bits– Area Overhead– High Power Consumption
q Increase Vddq Increase Refresh Rate
§ Reliability Inverse Relationship with Power
Page 13
Reliable Memory Operation
DATES Lab
Ø Technology Scaling§ DRAM
q Higher Soft Error Rateq Single Bit Upset -> Multiple Bit Upset
q Need for Error Correction
Page 14
Memory Reliability
DATES Lab
Ø Linear Code§ Any Linear Combination (XOR) of Codewords -> Codeword
Ø Block Code§ Same Length of Codewords§ Block Size based Decoding
Ø Linear Block Code§ Any Linear Combination (XOR) of Any Two Blocks -> Another
Block
Page 15
Linear Block Code
DATES Lab
Ø Separable Codeword§ n-bit code = k-bit message (information bit) + (n-k)-bit check bit§ (n, k) block code
Page 16
Linear Block Code
Data (Message)k-bit
Codewordn-bitDecoding
Encoding
Message(k-bit)
n-bit block
Check bit((n-k)-bit)
Codeword
DATES Lab
Ø Definition§ Redundancy = 1-log2(2k)/n§ Rate = k/n
§ Example : Single-bit Parity Code
q Redundancy = 1-log2(2k)/n = 1 – k/n = 1-3/4 = ¼q Rate = k/n = 3/4
Page 17
Linear Block Code
1 0 1 0
3-bit info
4-bit block
parity bit
DATES Lab
Ø Generator (G) and Parity Check (H) Matrix
Ø Code Generation§ c (codeword) = m (message) x G (Generator Matrix)
Ø Parity Check§ c x HT (Transposed Parity Check Matrix)
Page 18
Linear Block Code
Data (Message)k-bit
Codewordn-bitDecoding
Encoding
G-Matrix
H-Matrix
DATES Lab
Ø Example : (4, 3) Code§ 4-bit Codeword, 3-bit Message
q 1-bit Parity
§ m = 101 q Encoding : c (codeword) = m (message) x G (Generator Matrix)
– Codeword = 1010q Decoding : c x HT (Transposed Parity Check Matrix)
Page 19
Linear Block Code
DATES Lab
Ø Example : (7, 4) code§ 7-bit Codeword, 4-bit Message
q 3-bit Parity
q m = 0110 -> codeword = 0110011
Page 20
Linear Block Code
DATES Lab
Ø Example : (7, 4) code§ 7-bit Codeword, 4-bit Message
q Single-bit Error– codeword = 0110011 -> 1110011 (v)
q Single-bit Correction : 1110011 -> 011011 m = 0110
Page 21
Linear Block Code
DATES Lab
Ø Example : (7, 4) code§ 7-bit Codeword, 4-bit Message
q Double-bit Error– codeword = 0110011 -> 1010011 (v)
q Error Free?? -> 1010011 m = 0110– Miscorrection Problem
Page 22
Linear Block Code
DATES Lab
Ø Example : (7, 3) code§ 7-bit Codeword, 3-bit Message
q Double-bit Error– codeword = 0110011 -> 1010011 (v)
q Syndrome -> Non-Zero -> Double Error Detected– More Parity Needed!!
Page 23
Linear Block Code
DATES Lab
Ø (n, k) Linear Block Code§ k Data Bits -> Encoded to n-Bit Codewords§ r Check Bits, r = n – k§ H-Matrix (Parity Check Matrix)
q r x nq C Codeword IFF H·CT = 0
§ Syndromeq S = H·Verror = H·(V⊕E) = H·V⊕ = H· Eq No Error (E = 0), S = 0
Page 24
Linear Block Code
DATES Lab
Ø SEC-DED Codes§ Single Error Correction & Double Error Detection§ More than Two Bit Errors?
q Error Detection Not Guaranteedq Error Miscorrection Possibility
§ Solutionq Add More Check Bits
§ Exampleq (7, 3) SEC-DED Codes
Page 25
Linear Block Code
DATES Lab
Ø Example§ (7, 3) SEC-DED Hsiao Code
q 7C3 = (35) Possible 3-Bit Errorsq 28 out of 35 Miscorrection
Page 26
Linear Block Code
DATES Lab
Ø Example§ (7, 3) SEC-DED Hsiao Code
q Adding One More Row in H-Matrix -> Adding One More Check Bit
q 12 Miscorrection out of 56 (= 8C3) 3-Bit Errors
Page 27
Linear Block Code
DATES Lab
Ø SEC-DED Codes§ Single Error Correction & Double Error Detection
§ More than Two Bit Errors?q Error Detection Not Guaranteed
q Error Miscorrection Possibility
§ Solutionq Add More Check Bits
§ Problemq Bigger Memory Array Required (Lower Memory Efficiency)
q Power Consumption
Page 28
Linear Block Code
DATES Lab
Ø Need to Achieve Reliable Memory Operation
Ø Need to Achieve Low Memory Overhead
Ø Reliable Memory Operation using Spare Memory Columns
Page 29
Linear Block Code
DATES Lab
Ø Spare Columns / Rows§ Resides in Memory for Repair§ Yield Enhancement
Page 30
Utilization of Spare Columns for ECC
DATES Lab
Ø Spare Columns§ May Exist after Repair§ WEARs : Working cElls After Repair
Page 31
Utilization of Spare Columns for ECC
DATES Lab
Ø Spare Columns§ May Exist after Repair§ WEARs : Working cElls After Repair
Ø Utilize Spare Columns and WEARs§ Store Additional Check-bits
→ Reliability Enhancement§ Without Increasing Memory Array Size
→ Effectively Reducing Power Consumption
Page 32
Utilization of Spare Columns for ECC
DATES Lab
Ø Method-1§ Storing Repair Information in Spare Column
Page 33
Utilization of Spare Columns for ECC
15 Additional Check-Bits!!
DATES Lab
Ø Method-1§ Size of Additional Check Bits (n Bit Codeword, r Total Spare
Columns, m Used Spare Columns)
q Increased Codeword Size for Row without Defective Cellsn-Bit Codeword + (r - 1)-Bit Additional Check bit
q Increased Codeword Size for Row without Defective Cellsn-Bit Codeword + (r - m - 1)-Bit Additional Check bitPage 34
Utilization of Spare Columns for ECC
DATES Lab
Ø Method-1§ Architecture
Page 35
Utilization of Spare Columns for ECC
Rec
onfig
urat
ion
Logi
c
Che
ck B
it G
ener
ator
01
01
Memory SpareColumns
Syndrome Generator
Correction Logic
Dat
a B
its
ECC Enhanced Codeword
MaskingInfo
+ Error Detected
Additional Check Bits : # of Spare Columns -1
DATES Lab
Ø Comparison
Page 36
Reliability Enhancement Result
Before Repair
After Repair (Logical)
Codewords
Syndrome Generator
Generated Syndrome Not Using Last Two Bits
DATES Lab
Ø Comparison
Page 37
Reliability Enhancement Result
Before Repair
After Repair (Logical)
Codewords
Syndrome Generator
Generated Syndrome by Method-1
DATES Lab
Ø Comparison § MTBER (Maximally Tolerable Bit Error Rate)
Page 38
Reliability Enhancement Result
1K x 32bit 8K x 64bit
DATES Lab
Ø Method-2§ Overcoming Method-1 Limitation§ Using CAM (Content-Addressable Memory)
Page 39
Utilization of Spare Columns for ECC
22 Additional Check-Bits!!
DATES Lab
Ø Method-2§ Using CAM (Content-Addressable Memory)
Page 40
Utilization of Spare Columns for ECC
WORD 0WORD 1WORD 2WORD 3
WORD w-2WORD w-1
+
+
1-bit discard
n-bit discard
search data register
Word Address
matchlines
searchlines
n: number of spare columns in memory
DATES Lab
Ø Method-2§ Architecture
Page 41
Utilization of Spare Columns for ECC
Rec
onfig
urat
ion
Logi
c
Che
ck B
it G
ener
ator
MemorySpare
Columns
Dat
a B
its
ECC Enhanced Codeword
Encode
Con
tent
Add
ress
able
M
emor
yDefect Information
Syndrome Generator
Correction Logic
+ Error Detected
Additional Check Bits :# of Spare Columns
+ + +
DATES Lab
Ø Comparison
Page 42
Reliability Enhancement Result
Before Repair
After Repair (Logical)
Codewords
Syndrome Generator
Generated Syndrome by Method-2
DATES Lab
Ø ComparisonØ MTBER (Maximally Tolerable Bit Error Rate)
Page 43
Reliability Enhancement Result
1K x 32bit 8K x 64bit
DATES Lab
Ø Area Overhead§ Enlarging Memory Array vs. Using Additional CAM
Page 44
Reliability Enhancement Result
DATES Lab
Ø Power Overhead Estimation§ Enlarging Memory Array vs. Using Additional CAM
q CAM Requiring 1e-6 Power Consumptionq No Performance Overhead
Page 45
Reliability Enhancement Result
CAM - ISSCC [15]Capacity 128x64bLatency 0.96ns
Energy/32b 16.32 fJ
DDR 3tCL (param.) 13.75ns
tRCD (param.) 13.75nsRead (Energy) 18 nJWrite (Energy) 20 nJ
DATES Lab
Ø Q&A?
Page 46
Thank you