Cost-Optimization andChip Implementation of On-Chip Network
이세중 (Se-Joong Lee)2005. 5. 17Semiconductor System Lab,Dept. of EE, KAIST
Doctoral Thesis Presentation
Ph.D. Thesis Presentation, Se-Joong Lee 2
Outline
Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCNChip Implementations & Measurement ResultsConclusions
Ph.D. Thesis Presentation, Se-Joong Lee 3
Introduction
System-on-Chip DesignComplexity of interconnect
Interconnects more than 10 processorsRequires a few GB/s on-chip trafficCommunication channel is bottleneck
Uncertain PHY parameters< 100nm eraProne to process variations
And, other issuesClock synchronization @ ~ GHz
On-Chip Network
CELL processor
200 nm
Oxide Spacer
FG
CG
W
60nm
Ph.D. Thesis Presentation, Se-Joong Lee 4
Introduction (Cont’d)
OCN concept
Basic architecturedefinition
Architectureoptimization
Advancedfeaturing
Chipimplementation
Previousworks
Contributionof this work
Ph.D. Thesis Presentation, Se-Joong Lee 5
Background
Research Issues on OCN
processing unit(node)
switch
queuing buffer
Networktopology
Pack
et
Protocol /Packet format
Routingscheme
Switchingscheme
Flow control
Packet
Ph.D. Thesis Presentation, Se-Joong Lee 6
Background (Cont’d)
Active working groups & resultsStanford – Bologna
Netchip project = SUNMAP + XpipesCompilerTopology configuration, Automated systemC model synthesis
Royal Institute of TechnologyNostrum projectProtocol, Application-to-NoC mapping
PhilipsAEthereal projectQoS guaranteed OCN
PrincetonPower modeling and simulation
Limitations of the previous works→Lack of considerations on implementation
issues
Ph.D. Thesis Presentation, Se-Joong Lee 7
Background (Cont’d)
Chip implementation casesPleiades
UC Berkeley, ISSCC2000Static configured Mesh-like interconnectNo concept of network
Programmable SwitchST Microelectronics, ISSCC2003Programmable switch using Flash memory structure
RAWMIT, ISSCC2003Mesh network for 16 processorsNo dedicated researches on the network architecture
MIT’s RAW
Berkeley’s Pleiades
STM’s On-Chip Switch
Ph.D. Thesis Presentation, Se-Joong Lee 8
Research Topics in This Work
1. Network topology2. Channel width
3. Packet format/protocol
packet
4. Latency
5. Serdes
6. Low-power link
7. Mesochronous communication
Ph.D. Thesis Presentation, Se-Joong Lee 9
Introduction BackgroundCost-Optimized OCN Architecture
[1] Topology Selection[2] On-Chip Serialization[3] Packet format/protocol[4] Latency Reduction
Circuit Design of OCNChip Implementations & Measurement ResultsConclusion
Ph.D. Thesis Presentation, Se-Joong Lee 10
Topology Selection
General ideas on topology selectionTraffic pattern – Topology
Uniformly distributed traffic → Mesh-based topologyLocalized traffic → Hierarchical topology (ex. Tree, Hier.-Mesh)
TreeHierarchical Star
Topology selection in a levelMesh vs. Star according to
Link length# of nodes
Mesh or Star?
Cluster
Localnetwork
Globalnetwork
Hierarchical Star topology
[1]
Ph.D. Thesis Presentation, Se-Joong Lee 11
Topology Cost Comparison
Area equations (Approximated)
LKQBSWM ANNANNANNAareaMesh ⋅−+⋅−+⋅−= )(4)45()3625()( 1
LKQBSWS ANNANANAareaStar ⋅+⋅+⋅= 12)(
)()(316 ____ QBMLKMQBSLKSQBLK AAAAAAandN +≤+→≤≤
→≤ 16N SWMSWS AA __ ≤
SWSA _ QBSA _ LKSA _
SWMA _ QBMA _ LKMA _
Rule of thumbs
MSQBLK AAAAandN ≤→⋅≤≤ 316
MSQBLK EEEEandN ≤→⋅≤≤ 216
[1]
Ph.D. Thesis Presentation, Se-Joong Lee 12
Topology Cost Comparison (Cont’d)
Numerical analysisBased on actual implementation results
4 9 16 250.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
N
Ratio
AS / AM
L=2.0mmL=1.5mmL=1.0mmL=0.5mm
4 9 16 250.7
0.8
0.9
1.0
1.1
1.2
ES / EM
N
Ratio
L=2.0mmL=1.5mmL=1.0mmL=0.5mm
[1]
Ph.D. Thesis Presentation, Se-Joong Lee 13
Phit Size Determination
Wide vs. narrow phit sizeWide phit size
Realizes high-bandwidth easilyNot friendly in terms of P&RPossible high-energy consumption due to large building blocks
Small phit size (serialized link)Realizes small foot-print of an OCNBW limitationPossible high-energy consumption due to serdes and high clockfrequency
No previous works on phit size determination….
Optimal phit size determination is required
Ph.D. Thesis Presentation, Se-Joong Lee 14
On-Chip Serialization (OCS)
Effect on linksLink area (ALK) is reducedELK = EWIR + EDRV
Wider space → Coupling cap. reduction → EWIR ↓
Higher frequency → Smaller tapering factor → EDRV ↑
EWIR > EDRV, therefore, ELK ↓
OCS
1 2 4 80.4
0.5
0.6
0.7
0.8
0.9
1.0
RSER
Ene
rgy
ratio
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 15
On-Chip Serialization (OCS) (Cont’d)
Effects on linksSwitching activity
No encoding: up to x3 for linearly increasing data patternEncoded serial link: <10% overhead
[2]
Parallel/serial bus transition for memory accesses of 3D operations
0
0.5
1
1.5
2
2.5
3
3.5
INST_ADDR INST_DATA DATA_ADDR DATA_DATA
parallel busserial busencoded serial
Ph.D. Thesis Presentation, Se-Joong Lee 16
On-Chip Serialization (OCS) (Cont’d)
Effects on switchesESF + EARB
Arbiter operates packet-basis → OCS does not affectsSwitch fabric area/energy is reduced
PHIT
PH
IT
CJ0 CJ1
CJ= ΣCJi
CC= ΣCCi
CC0 CC1
W
PP
PP
W/RSER
)( JCSF CCWE +×= )(
)(
JSER
C
JSER
C
SERSERSF
CRCW
CRC
RWRE
+×=
+××=
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 17
On-Chip Serialization (OCS) (Cont’d)
Effects on queuing buffersBitline capacitance in a SRAM-type QB
RSER=1
RSER=2RSER=
4, 8, 16
88
8
44
16
22
32
PRE
WD
cell
cell
cell
BLSA
WLD
PRE
WD
cell
cell
cell
BLSAPRE
WD
cell
cell
cell
BLSA
WLD.....
... ... ...
Dual-port SRAM
AD
DR
GE
N
1 2 4 80.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
RSER
Ene
rgy
ratio
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 18
Transfer & Characteristic Matrix
Transfer matrixTransformation representing OCS operation
Characteristic matrixMatrix representing area/energy breakdown of an OCN
1.00 0.75 0.64 0.80
1.00 1.13 1.30 1.60
1.00 0.54 0.32 0.20
1.00 1.00 1.00 1.00
1:1 2:1 4:1 8:1
Link
QB
SF
Other
TME =
1.00 0.75 0.5 0.31
1.00 1.00 1.00 1.00
1.00 0.25 0.06 0.02
1.00 1.00 1.00 1.00
1:1 2:1 4:1 8:1
Link
QB
SF
Other
TMA =
CMME
1.60L 0.33 0.26 0.01
2.13L 0.33 0.58 0.01
3.20L 0.33 1.03 0.01
3.84L 0.33 1.61 0.01
Link QB SF Other
N= 4
N= 9
N=16
N=25
1.60L 0.99 0.60 0.03
2.13L 1.21 0.92 0.03
2.40L 1.32 1.09 0.03
2.56L 1.39 1.20 0.04
Link QB SF Other
N= 4
N= 9
N=16
N=25
0.72L 0.27 0.33 0.02
1.27L 0.27 0.75 0.04
1.79L 0.27 1.33 0.06
2.29L 0.27 2.08 0.10
Link QB SF Other
N= 4
N= 9
N=16
N=25
0.48L 0.32 0.50 0.04
0.85L 0.56 0.87 0.06
1.19L 0.77 1.20 0.07
1.53L 0.97 1.52 0.08
Link QB SF Other
N= 4
N= 9
N=16
N=25
CMSE CMMA CMSA
TMCMCM ×='
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 19
Optimal RSER Determination
1 2 4 8 RSER
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
(x10-10J)
Link
QB
SF
Others
Sta
r ene
rgy
RSER1 2 4 8
N=4N=9N=16N=25
(x10-10J)
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Mes
h en
ergy
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
(x105μm2)
1 2 4 8
N=4N=9N=16N=25
Mes
h ar
eaLink
QB
SF
Others
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
(x105μm2)
1 2 4 8 RSER
Star
are
a
Analyze AM, AS, EM, ESaccording to the N (4~25)1:1 ~ 8:1 serializationDefault phit size = 88bDistance b/w PUs = 1mmQB = 3-packet capacity
Optimal RSER = 4:1 OCS (or 22b phit size)
EM ES
ASAM
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 20
OCS-applied Topology Comparison
4 9 16 250.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0L=2.0mmL=1.5mmL=1.0mmL=0.5mm
N
Ratio
AS / AM
0.7
0.8
0.9
1.0
1.1
1.2
4 9 16 25
ES / EM
N
4 9 16 250.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
N
Ratio
AS / AM
L=2.0mmL=1.5mmL=1.0mmL=0.5mm
4 9 16 250.7
0.8
0.9
1.0
1.1
1.2
ES / EM
N
Ratio
L=2.0mmL=1.5mmL=1.0mmL=0.5mm
Area Reduction
Energy Reduction
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 21
Packet Format & Protocol
Typical vs. proposed packet formatTypical format
Layered architecturePhit, flit, and packet are defined independently
Proposed formatAligned architecturePhit and packet formats are tightly correlated
→ Low decoding-overhead (50% power reduction)→ Easy to scale sub-fields of a packet
Packet
Flit 1 . . .
Flit header
Flit 2 Flit 3 Flit n
SW header(16b)
NI header (4b)
PL0 (4b)
PL1(32b)
PL2(32b)
4b
1b
1b
8b
8b
link<3:0>
link<4>
link<5>
link<13:6>
link<21:14>
4b
88b
Typical formatProposed format
[3]
Ph.D. Thesis Presentation, Se-Joong Lee 22
Packet Format & Protocol (Cont’d)
Packet orderingProblem statement
Packet delivery time depends on network conditionWhen multiple packets are issued, the packet sequence could be disordered
Typical solutionPacket sequence number ← Requires large queue
Proposed solutionUses fixed routing NI controls read-packets
NI SW
NI
NI
A
B
C
C B BB
Hold 'A' until all the packettransactions between 'B' complete
SW
[3]
Ph.D. Thesis Presentation, Se-Joong Lee 23
Latency Reduction
MotivationOCN = Set of switches and QBs
Provides sufficient bandwidth, but suffers from large end-to-end latency
Propagation time vs. Switching timeSwitching time takes higher portion
SW SW
Tswitching
Tprop
2002 2003 2004 2005 2006 2007 2008 2009 2010 20110
200
400
600
800
1000
1200
1400
Del
ay ti
me
(pse
c)
Year
Tswitching
Tprop
[4]
Based on ITRS roadmap data
Ph.D. Thesis Presentation, Se-Joong Lee 24
Latency Reduction (Cont’d)
Previous work (AEthereal*)
Contention-free routingEliminates queuing and arbitration delaySwitches use slot tableGuarantees min. bandwidth and max. latency
LimitationsInitial packet transmission requires waiting-time for the time-slotRequires latching at every switchHard to apply to a mesochronous OCN
A
B
P
1 2 3
SW 1 SW 2 SW 3
[4]
*E. Rijpkema, et al., “Trade-offs in the Design of A router…”, DATE2003
Ph.D. Thesis Presentation, Se-Joong Lee 25
Proposed Scheme
SW 2 SW 3SW 1
A
B
Additional line
Storing packet route using actual connection !
Concept
[4]
Ph.D. Thesis Presentation, Se-Joong Lee 26
Proposed Scheme
ConceptStores information of frequently used route
SW 2 SW 3SW 1
A
B
Additional line
[4]
Cached Arbitration
Ph.D. Thesis Presentation, Se-Joong Lee 27
Cached-Arbitration Scheme
OperationsAR-path connection / disconnectionPR-path reservation / release
Design issueSynchronization issues on AR-path connection/disc. & PR-path reservation/release
Design overhead4b AR-link, ARARB, ARSW
AR-path setup time2xTprop + Tcycle
cf) Packet-based circuit path setup time = 2xTprop + NxTcycle
SWSW
A
AR-link
B
PR-link
AR-path
PR-path
ARRACKDISC
ARSP
ARRACKDISC
SW
ARSF
ARARB
NIFTR FTR
NI
[4]
Ph.D. Thesis Presentation, Se-Joong Lee 28
Summary
Basic architecture define
Topology Phit size
Packet Protocol
Advanced featuring
Latency
Circuit-levelLow-power/high-perf. schemes
▪ OCS▪ Optimal RSER
Cached Arbitration ▪ Star, Mesh▪ N, L
▪ Aligned format
▪ Packet ordering
Ph.D. Thesis Presentation, Se-Joong Lee 29
Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCN
[1] Mesochronous Communication[2] Wave-Front-Train: A New Serdes Scheme[3] Adaptive Bandwidth Control
Chip Implementations & Measurement ResultsConclusion
Ph.D. Thesis Presentation, Se-Joong Lee 30
Mesochronous Communication
MesochronousfCLK (transmitter) = fCLK (receiver)φ (transmitter) ≠ φ (receiver)
Issues in mesochronous signalingUnknown latch-timingPossible failure due to metastable
Conventional solutionsPipeline synchronizerFIFO synchronizer
in p u ts ig n a l
c lk
la tc h e ds ig n a l
DQ DQ
FIFO
[1]
Ph.D. Thesis Presentation, Se-Joong Lee 31
Programmable Delay Synchronizer
MotivationOn-chip situations causing phase variation
Packet-bypassing technique (physical distance variation)Supply voltage control (electrical distance variation)Network clock frequency variation (reference timing variation)
“Phase variation” is unknown but quantized
My approachProgrammable delay
Memorize appropriate delay for a given network-modeRecall the delay value according to the network-mode
V D
Registers
Network mode
PhaseDetector
Phase information
Inputsignal
Synchronized output
[1]
Ph.D. Thesis Presentation, Se-Joong Lee 32
Programmable Delay Synchronizer (Cont’d)
CalibrationInput signal sampling @ 0, TD, 2TD
Mainclock
Delayed clocks
TDTD
2TD
VoidV D
PDU
V D
D
D
V DV D
Registers
ProgramEnableMode
DONE To switch
encoder
D<1>D<0>
D<2>D<3>
...CLK
Packet<1>Packet<2>Packet<3>
Packet<0>
CMU
NIPU
SYNC
SYNC SYNC ......
...
Switch
[1]
Ph.D. Thesis Presentation, Se-Joong Lee 33
High-Speed Serdes
Conventional architectureShift-register + MUX
LimitationsRequires CLKSER (= RSER x CLK)Synchronization issue at a receiverMax. frequency is limited by TSETUP + TMUX +THOLD
D Q
CLKLOADb
D3 D2 D1 D0
D Q D Q D Q D Q
CLK
Q3 Q2 Q1 Q0
0
1D Q
0
1D Q
0
1D Q
EN EN EN EN
EN
EN EN EN ENSOUT
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 34
Proposed Serdes Scheme
Wave-Front-Train (WAFT) Serdesconcept
Data separation is done by delay elements, not D-FFs
"0"
D0
delayD0
D1
delayD1
D2
delayD2
D3
delayD3
"1" (pilot signal)
delay"1"
"0"
....
delaydelaydelaydelay
Q3 Q2 Q1 Q0
ENb (=1)STOP (=0)
D0
delay
D1
delay
D1
D2
delay
D2
D3
delay
D3
"1"
delay
"1"
"0"
....
delaydelaydelaydelay
Q3 Q2 Q1 Q0
ENb (=0)STOP (=0) D0
"0"
D0
delay
D1
delay
D1
D2
delay
D2
D3
delay
D3
delay
"1"
"0"
....
delaydelaydelaydelay
Q3 Q2 Q1 Q0
ENb (=0)STOP
D0
"1"
"0"
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 35
WAFT Serdes Operations
Schematic & timing diagrams
0
1DE
D2 D1 D0
0
1
D3
EN
VDD0
1DE
0
1DE
0
1DE
0
1DE
BUF
QS3 QS2 QS1 QS0 QP
MUXP MUXO
(pilot signal)
SOUT0
1
0
1
DE 0
1
Q2 Q1 Q0Q3
DE
/2
STOP
0
1
DE 0
1
DE DE
Serializer Deserializer
TBUF+ TU + 0.5xTDE
D0 D1 D2 D3
D0 D1 D2
D0 D1
D0
TU
TU
TU
pilot
D3
D2
D1
D0
STOP
SOUT
Q3
Q2
Q1
Q0
TBUF+ TU TU
EN
TMUX
( TU = TMUX + TDE )
TWAVE
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 36
WAFT Deserializer
MUX vs. Latch
A
B
C
SEL
SEL
SELb
A
C
SEL
SEL
SELb
A C
SEL
SEL
SELbSELb
=
B
AC
AC
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 37
Advantages of WAFT
High-performanceAchieves up to 4.3Gb/s @ 1.8V, 0.18μm CMOS tech.Conventional serdes: Max. 2Gb/s @ same condition
Low-power consumptionPower-consuming D-FFs are eliminated47% less power-consumption @ 2Gb/s operation
Low system overheadCLKSER is not required → No burden to clock generation circuit
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 38
VDD Sensitivity of WAFT
Effects of supply voltage (VDD)Jitter at deserializer data if TU(ser) ≠ TU(des)
Transmitter Receiver
VDD2
VDD1
T=0
Transmitter Receiver
T=1
Transmitter ReceiverTD
TDVDD
Transmitter Receiver
T=n
T=2
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 39
VDD Sensitivity of WAFT (Cont’d)
Effects of supply voltage (VDD) (Cont’d)
V(w0,t) = V(w1,t) =…= V(wn,t)→ No jitter
Transmitter Receiver
TD
Transmitter Receiver
TD'Transmitter Receiver
TD''Transmitter Receiver
TD''
Uniform VDD
Uniform VDD
T=0
T=1
T=n-1
T=n
. . .
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 40
Solution for VDD Variation
VDD insensitive delayCurrent-starving logic
Constant VGS guarantees Low-VGS: Stable but slowHigh-VGS: Unstable but fast
VREFP
VREFN
VGS
VGS
0.55 0.60 0.65 0.70 0.75 0.80 0.85
1.2
1.4
1.6
1.8
2.0
2.2
2.4
TWAVE (ns)
VGS
Low VGSregion
High VGSregion
ΔVGS forzero jitter
Jitter at +10% ΔVDD
1.62V1.80V1.98V
Current-starving inverter VGS vs. delay time
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 41
Adaptive VGS Control
VDD-dependent VREF generation
IM1
IM2
IM3
IM4
VREF generator
Current profile according to VDD variation0.5 1.0 1.5 2.0
160
140
120
100
80
60
40
20
0 IM3
I (mA)
VDD (V)
IM1
IM4 = IM1 - IM2 - IM3
VDD =1.62~1.98V
IM2
800
600
640
680
720
760
840
880
VREFN (mV)
1.81.6 2.0
Ideal V GS for zero jitterVREFN curve of the ref. generator
VDD (V)
Ideal VGS vs. generated VREF
[2]
Ph.D. Thesis Presentation, Se-Joong Lee 42
Link Energy-Consumption Reduction
Adaptive bandwidth controlMotivation
Max. BW = 0.35 / TR
TR depends on supply voltageWhen less bandwidth is required, supply voltage of a link can be reduced to save energy-consumption
Implementation issuesOperation frequency change of a serializer according to the supply voltage variation
BMAX
1.2 1.4 1.6 1.8
1.01.21.41.61.82.02.22.42.62.83.0
BW(Gb/s)
Supply voltage (V)
BOUTDatarate(ser&link)
0.8
Datarate(switch output)
0.8Gb/s
1.6Gb/s
Using a WAFT serdes, output signal bandwidth changes automatically according to VDD.
BMAX = Max. affordable BWBOUT = Output signal BW
[3]
Ph.D. Thesis Presentation, Se-Joong Lee 43
Link Energy-Consumption Reduction (Cont’d)
Energy-consumption
VDDLVDD
LVENLVENb
IN OUT
ETYP
ELV = 0.36ETYP
EHV ~ ETYP
EL2H = 1.42ETYP
EH2L ~ ETYP
phit
If N(phit) ≥ 4 @ LV, transition overhead is compensated
Dual VDD driving buffer Energy-consumption graph
[3]
Ph.D. Thesis Presentation, Se-Joong Lee 44
Summary
Basic architecture define
Topology Phit size
Packet Protocol
Advanced featuring
Latency
▪ OCS▪ Optimal RSER
Cached Arbitration ▪ Star, Mesh▪ N, L
▪ Aligned format
▪ Packet ordering
Energy-efficient circuits▪WAFT serdes▪ Programmable sync▪ Adaptive BW control
Ph.D. Thesis Presentation, Se-Joong Lee 45
Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCNChip Implementations & Measurement Results
1st chip implementation & measurement results2nd chip implementation & measurement results
Conclusion
Ph.D. Thesis Presentation, Se-Joong Lee 46
1st Chip Implementation
Modeling SoC environment of non-synchronized heterogeneous PUs
Each clock source can be stimulated independently by external equipments. Every component is maximally spread apart from each otherClock distribution is done manually without consideration of clock skew problem
264mW @ 800MHz, 2.3VPower
81,000(excluding 1kB SRAM)
Tr. count
0.38μm CMOS technology Tech.
10.8 x 6.0 mm2Die size
Ph.D. Thesis Presentation, Se-Joong Lee 47
Measurement Results
12 cycles
DATASTB
CLKNET
CLKSRAM
OENSRAM3 SRAM data
CLKNET(300MHz)
DATASTB3 Packets
Packet SRAM
OENSRAM
NI
CLKSRAM(75MHz)
DNS2
UPS2
SW
UPS1
DNS1
CLKPU1(75MHz)
12 cycles
DATASTB
CLKNET
CLKSRAM
OENSRAM3 SRAM data
CLKNET(300MHz)
DATASTB3 Packets
Packet SRAM
OENSRAM
NI
CLKSRAM(75MHz)
DNS2
UPS2
SW
UPS1
DNS1
CLKPU1(75MHz)
Single chip operation
CLKOGWOENOGW
OENSRAM
TKN_PASS
ChipA-to-B packet transaction
SRAM read
OGWBgrabs token
ChipB-to-A packet transaction
OGWA
Chip A
PLL
SW
UPS
DNS
OGWB
Chip B
PLL
SW
CLKREF
OENOGW
TKN_PASS
OENSRAM
NI
SRAM
Bus
TOKEN
CLKOGWOENOGW
OENSRAM
TKN_PASS
ChipA-to-B packet transaction
SRAM read
OGWBgrabs token
ChipB-to-A packet transaction
OGWA
Chip A
PLL
SW
UPS
DNS
OGWB
Chip B
PLL
SW
CLKREF
OENOGW
TKN_PASS
OENSRAM
NI
SRAM
Bus
TOKEN
Chip-to-chip operation
Ph.D. Thesis Presentation, Se-Joong Lee 48
2nd Chip Implementation
Overall ArchitectureTwo-level hierarchical star-topology
(1 level-2, 4 level-1 switches)
3 masters (traffic generators), 1 SRAM (256b)400MHz operation
Implemented featuresBurst-packet bypassingWAFT serdesProgrammable delay SYNCAdaptive bandwidth control
400MHz (1.6Gb/s) @ 1.8VMax. Freq.
409kTr. count
0.18μm CMOS technology Tech.
4.0 x 4.0 mm2Die size
Ph.D. Thesis Presentation, Se-Joong Lee 49
Switch Layout
Arbiter
SwitchFabricInport 1
Inport 2
Inport 3
Inport 4
Inport 5
Outport 1
Outport 2 Outport 3 Outport 4
Outport 5
SER2SER1
DES2
DES1
1200um
1000
um
12%18m x 280um2 x 2euDeserializer
10%320 x 130um2 x 2euSerializer
15%180 x 140um2 x 5euOutput port
25%320 x 136um2 x 5eu Input FIFO
8%400 x 170um2Arbiters
31%500 x 530um2Switch fabric*
PortionSizeBlock name
Switch area breakdown
* 1 port = 44b (2:1 serialized)
Ph.D. Thesis Presentation, Se-Joong Lee 50
Measurement Results
8:1 WAFT serialized output
0101010111010101pilot pilot
10101010pilot
“11111110” “11111111” “00000000”
Ph.D. Thesis Presentation, Se-Joong Lee 51
Measurement Results (Cont’d)
Eye diagram
D0 D1 D2 D3 D4 D5 D6 D7
pilot signal
clock
250ps
Ph.D. Thesis Presentation, Se-Joong Lee 52
Measurement Results (Cont’d)
Deserializer jitter
16.91ps 33.00ps
17.07ps 23.70ps
No noise Noise
No noise Noise
Typical WAFT
WAFT with low-jitter scheme
Ph.D. Thesis Presentation, Se-Joong Lee 53
Measurement Results (Cont’d)
Adaptive bandwidth control
low-bandwidth enabled
SOUT
CLK
Typical WAFT
Ph.D. Thesis Presentation, Se-Joong Lee 54
Measurement Results (Cont’d)
Programmable delay synchronizerProblem statement
Unstable
CLK
EN|mode A
EN|mode B
EN|mode C
1-clock delayed fetch
Ph.D. Thesis Presentation, Se-Joong Lee 55
Measurement Results (Cont’d)
Programmable delay synchronizerProblem statement (cont’d)
CLK
TA TB
EN
1-clock delay fetch
Ph.D. Thesis Presentation, Se-Joong Lee 56
Measurement Results (Cont’d)
Programmable delay synchronizerWith programmable delay synchronizer
Extension
CLK
EN|mode C
(without programmablesynchronizer)
EN|mode C
(with programmablesynchronizer)
Ph.D. Thesis Presentation, Se-Joong Lee 57
Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCNChip Implementations & Measurement ResultsConclusions
Ph.D. Thesis Presentation, Se-Joong Lee 58
Conclusions
Cost-effective OCN architecture is definedStar vs. Mesh comparisonPacket format and protocols considering on-chip SoC situations
Cost-optimization is performedOCS for area/energy reduction
Advanced featuring is performedCAS for Latency reductionWAFT for low-power/high-performance serdesEfficient synchronizer using programmable delayAdaptive bandwidth control for link energy-consumption reduction
OCN chips are implemented
Real On-Chip Network is realized