58
Cost-Optimization and Chip Implementation of On-Chip Network 이세중 (Se-Joong Lee) 2005. 5. 17 Semiconductor System Lab, Dept. of EE, KAIST Doctoral Thesis Presentation

Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Cost-Optimization andChip Implementation of On-Chip Network

이세중 (Se-Joong Lee)2005. 5. 17Semiconductor System Lab,Dept. of EE, KAIST

Doctoral Thesis Presentation

Page 2: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 2

Outline

Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCNChip Implementations & Measurement ResultsConclusions

Page 3: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 3

Introduction

System-on-Chip DesignComplexity of interconnect

Interconnects more than 10 processorsRequires a few GB/s on-chip trafficCommunication channel is bottleneck

Uncertain PHY parameters< 100nm eraProne to process variations

And, other issuesClock synchronization @ ~ GHz

On-Chip Network

CELL processor

200 nm

Oxide Spacer

FG

CG

W

60nm

Page 4: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 4

Introduction (Cont’d)

OCN concept

Basic architecturedefinition

Architectureoptimization

Advancedfeaturing

Chipimplementation

Previousworks

Contributionof this work

Page 5: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 5

Background

Research Issues on OCN

processing unit(node)

switch

queuing buffer

Networktopology

Pack

et

Protocol /Packet format

Routingscheme

Switchingscheme

Flow control

Packet

Page 6: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 6

Background (Cont’d)

Active working groups & resultsStanford – Bologna

Netchip project = SUNMAP + XpipesCompilerTopology configuration, Automated systemC model synthesis

Royal Institute of TechnologyNostrum projectProtocol, Application-to-NoC mapping

PhilipsAEthereal projectQoS guaranteed OCN

PrincetonPower modeling and simulation

Limitations of the previous works→Lack of considerations on implementation

issues

Page 7: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 7

Background (Cont’d)

Chip implementation casesPleiades

UC Berkeley, ISSCC2000Static configured Mesh-like interconnectNo concept of network

Programmable SwitchST Microelectronics, ISSCC2003Programmable switch using Flash memory structure

RAWMIT, ISSCC2003Mesh network for 16 processorsNo dedicated researches on the network architecture

MIT’s RAW

Berkeley’s Pleiades

STM’s On-Chip Switch

Page 8: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 8

Research Topics in This Work

1. Network topology2. Channel width

3. Packet format/protocol

packet

4. Latency

5. Serdes

6. Low-power link

7. Mesochronous communication

Page 9: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 9

Introduction BackgroundCost-Optimized OCN Architecture

[1] Topology Selection[2] On-Chip Serialization[3] Packet format/protocol[4] Latency Reduction

Circuit Design of OCNChip Implementations & Measurement ResultsConclusion

Page 10: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 10

Topology Selection

General ideas on topology selectionTraffic pattern – Topology

Uniformly distributed traffic → Mesh-based topologyLocalized traffic → Hierarchical topology (ex. Tree, Hier.-Mesh)

TreeHierarchical Star

Topology selection in a levelMesh vs. Star according to

Link length# of nodes

Mesh or Star?

Cluster

Localnetwork

Globalnetwork

Hierarchical Star topology

[1]

Page 11: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 11

Topology Cost Comparison

Area equations (Approximated)

LKQBSWM ANNANNANNAareaMesh ⋅−+⋅−+⋅−= )(4)45()3625()( 1

LKQBSWS ANNANANAareaStar ⋅+⋅+⋅= 12)(

)()(316 ____ QBMLKMQBSLKSQBLK AAAAAAandN +≤+→≤≤

→≤ 16N SWMSWS AA __ ≤

SWSA _ QBSA _ LKSA _

SWMA _ QBMA _ LKMA _

Rule of thumbs

MSQBLK AAAAandN ≤→⋅≤≤ 316

MSQBLK EEEEandN ≤→⋅≤≤ 216

[1]

Page 12: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 12

Topology Cost Comparison (Cont’d)

Numerical analysisBased on actual implementation results

4 9 16 250.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

N

Ratio

AS / AM

L=2.0mmL=1.5mmL=1.0mmL=0.5mm

4 9 16 250.7

0.8

0.9

1.0

1.1

1.2

ES / EM

N

Ratio

L=2.0mmL=1.5mmL=1.0mmL=0.5mm

[1]

Page 13: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 13

Phit Size Determination

Wide vs. narrow phit sizeWide phit size

Realizes high-bandwidth easilyNot friendly in terms of P&RPossible high-energy consumption due to large building blocks

Small phit size (serialized link)Realizes small foot-print of an OCNBW limitationPossible high-energy consumption due to serdes and high clockfrequency

No previous works on phit size determination….

Optimal phit size determination is required

Page 14: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 14

On-Chip Serialization (OCS)

Effect on linksLink area (ALK) is reducedELK = EWIR + EDRV

Wider space → Coupling cap. reduction → EWIR ↓

Higher frequency → Smaller tapering factor → EDRV ↑

EWIR > EDRV, therefore, ELK ↓

OCS

1 2 4 80.4

0.5

0.6

0.7

0.8

0.9

1.0

RSER

Ene

rgy

ratio

[2]

Page 15: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 15

On-Chip Serialization (OCS) (Cont’d)

Effects on linksSwitching activity

No encoding: up to x3 for linearly increasing data patternEncoded serial link: <10% overhead

[2]

Parallel/serial bus transition for memory accesses of 3D operations

0

0.5

1

1.5

2

2.5

3

3.5

INST_ADDR INST_DATA DATA_ADDR DATA_DATA

parallel busserial busencoded serial

Page 16: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 16

On-Chip Serialization (OCS) (Cont’d)

Effects on switchesESF + EARB

Arbiter operates packet-basis → OCS does not affectsSwitch fabric area/energy is reduced

PHIT

PH

IT

CJ0 CJ1

CJ= ΣCJi

CC= ΣCCi

CC0 CC1

W

PP

PP

W/RSER

)( JCSF CCWE +×= )(

)(

JSER

C

JSER

C

SERSERSF

CRCW

CRC

RWRE

+×=

+××=

[2]

Page 17: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 17

On-Chip Serialization (OCS) (Cont’d)

Effects on queuing buffersBitline capacitance in a SRAM-type QB

RSER=1

RSER=2RSER=

4, 8, 16

88

8

44

16

22

32

PRE

WD

cell

cell

cell

BLSA

WLD

PRE

WD

cell

cell

cell

BLSAPRE

WD

cell

cell

cell

BLSA

WLD.....

... ... ...

Dual-port SRAM

AD

DR

GE

N

1 2 4 80.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

RSER

Ene

rgy

ratio

[2]

Page 18: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 18

Transfer & Characteristic Matrix

Transfer matrixTransformation representing OCS operation

Characteristic matrixMatrix representing area/energy breakdown of an OCN

1.00 0.75 0.64 0.80

1.00 1.13 1.30 1.60

1.00 0.54 0.32 0.20

1.00 1.00 1.00 1.00

1:1 2:1 4:1 8:1

Link

QB

SF

Other

TME =

1.00 0.75 0.5 0.31

1.00 1.00 1.00 1.00

1.00 0.25 0.06 0.02

1.00 1.00 1.00 1.00

1:1 2:1 4:1 8:1

Link

QB

SF

Other

TMA =

CMME

1.60L 0.33 0.26 0.01

2.13L 0.33 0.58 0.01

3.20L 0.33 1.03 0.01

3.84L 0.33 1.61 0.01

Link QB SF Other

N= 4

N= 9

N=16

N=25

1.60L 0.99 0.60 0.03

2.13L 1.21 0.92 0.03

2.40L 1.32 1.09 0.03

2.56L 1.39 1.20 0.04

Link QB SF Other

N= 4

N= 9

N=16

N=25

0.72L 0.27 0.33 0.02

1.27L 0.27 0.75 0.04

1.79L 0.27 1.33 0.06

2.29L 0.27 2.08 0.10

Link QB SF Other

N= 4

N= 9

N=16

N=25

0.48L 0.32 0.50 0.04

0.85L 0.56 0.87 0.06

1.19L 0.77 1.20 0.07

1.53L 0.97 1.52 0.08

Link QB SF Other

N= 4

N= 9

N=16

N=25

CMSE CMMA CMSA

TMCMCM ×='

[2]

Page 19: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 19

Optimal RSER Determination

1 2 4 8 RSER

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

(x10-10J)

Link

QB

SF

Others

Sta

r ene

rgy

RSER1 2 4 8

N=4N=9N=16N=25

(x10-10J)

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Mes

h en

ergy

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

(x105μm2)

1 2 4 8

N=4N=9N=16N=25

Mes

h ar

eaLink

QB

SF

Others

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

(x105μm2)

1 2 4 8 RSER

Star

are

a

Analyze AM, AS, EM, ESaccording to the N (4~25)1:1 ~ 8:1 serializationDefault phit size = 88bDistance b/w PUs = 1mmQB = 3-packet capacity

Optimal RSER = 4:1 OCS (or 22b phit size)

EM ES

ASAM

[2]

Page 20: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 20

OCS-applied Topology Comparison

4 9 16 250.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0L=2.0mmL=1.5mmL=1.0mmL=0.5mm

N

Ratio

AS / AM

0.7

0.8

0.9

1.0

1.1

1.2

4 9 16 25

ES / EM

N

4 9 16 250.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

N

Ratio

AS / AM

L=2.0mmL=1.5mmL=1.0mmL=0.5mm

4 9 16 250.7

0.8

0.9

1.0

1.1

1.2

ES / EM

N

Ratio

L=2.0mmL=1.5mmL=1.0mmL=0.5mm

Area Reduction

Energy Reduction

[2]

Page 21: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 21

Packet Format & Protocol

Typical vs. proposed packet formatTypical format

Layered architecturePhit, flit, and packet are defined independently

Proposed formatAligned architecturePhit and packet formats are tightly correlated

→ Low decoding-overhead (50% power reduction)→ Easy to scale sub-fields of a packet

Packet

Flit 1 . . .

Flit header

Flit 2 Flit 3 Flit n

SW header(16b)

NI header (4b)

PL0 (4b)

PL1(32b)

PL2(32b)

4b

1b

1b

8b

8b

link<3:0>

link<4>

link<5>

link<13:6>

link<21:14>

4b

88b

Typical formatProposed format

[3]

Page 22: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 22

Packet Format & Protocol (Cont’d)

Packet orderingProblem statement

Packet delivery time depends on network conditionWhen multiple packets are issued, the packet sequence could be disordered

Typical solutionPacket sequence number ← Requires large queue

Proposed solutionUses fixed routing NI controls read-packets

NI SW

NI

NI

A

B

C

C B BB

Hold 'A' until all the packettransactions between 'B' complete

SW

[3]

Page 23: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 23

Latency Reduction

MotivationOCN = Set of switches and QBs

Provides sufficient bandwidth, but suffers from large end-to-end latency

Propagation time vs. Switching timeSwitching time takes higher portion

SW SW

Tswitching

Tprop

2002 2003 2004 2005 2006 2007 2008 2009 2010 20110

200

400

600

800

1000

1200

1400

Del

ay ti

me

(pse

c)

Year

Tswitching

Tprop

[4]

Based on ITRS roadmap data

Page 24: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 24

Latency Reduction (Cont’d)

Previous work (AEthereal*)

Contention-free routingEliminates queuing and arbitration delaySwitches use slot tableGuarantees min. bandwidth and max. latency

LimitationsInitial packet transmission requires waiting-time for the time-slotRequires latching at every switchHard to apply to a mesochronous OCN

A

B

P

1 2 3

SW 1 SW 2 SW 3

[4]

*E. Rijpkema, et al., “Trade-offs in the Design of A router…”, DATE2003

Page 25: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 25

Proposed Scheme

SW 2 SW 3SW 1

A

B

Additional line

Storing packet route using actual connection !

Concept

[4]

Page 26: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 26

Proposed Scheme

ConceptStores information of frequently used route

SW 2 SW 3SW 1

A

B

Additional line

[4]

Cached Arbitration

Page 27: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 27

Cached-Arbitration Scheme

OperationsAR-path connection / disconnectionPR-path reservation / release

Design issueSynchronization issues on AR-path connection/disc. & PR-path reservation/release

Design overhead4b AR-link, ARARB, ARSW

AR-path setup time2xTprop + Tcycle

cf) Packet-based circuit path setup time = 2xTprop + NxTcycle

SWSW

A

AR-link

B

PR-link

AR-path

PR-path

ARRACKDISC

ARSP

ARRACKDISC

SW

ARSF

ARARB

NIFTR FTR

NI

[4]

Page 28: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 28

Summary

Basic architecture define

Topology Phit size

Packet Protocol

Advanced featuring

Latency

Circuit-levelLow-power/high-perf. schemes

▪ OCS▪ Optimal RSER

Cached Arbitration ▪ Star, Mesh▪ N, L

▪ Aligned format

▪ Packet ordering

Page 29: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 29

Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCN

[1] Mesochronous Communication[2] Wave-Front-Train: A New Serdes Scheme[3] Adaptive Bandwidth Control

Chip Implementations & Measurement ResultsConclusion

Page 30: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 30

Mesochronous Communication

MesochronousfCLK (transmitter) = fCLK (receiver)φ (transmitter) ≠ φ (receiver)

Issues in mesochronous signalingUnknown latch-timingPossible failure due to metastable

Conventional solutionsPipeline synchronizerFIFO synchronizer

in p u ts ig n a l

c lk

la tc h e ds ig n a l

DQ DQ

FIFO

[1]

Page 31: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 31

Programmable Delay Synchronizer

MotivationOn-chip situations causing phase variation

Packet-bypassing technique (physical distance variation)Supply voltage control (electrical distance variation)Network clock frequency variation (reference timing variation)

“Phase variation” is unknown but quantized

My approachProgrammable delay

Memorize appropriate delay for a given network-modeRecall the delay value according to the network-mode

V D

Registers

Network mode

PhaseDetector

Phase information

Inputsignal

Synchronized output

[1]

Page 32: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 32

Programmable Delay Synchronizer (Cont’d)

CalibrationInput signal sampling @ 0, TD, 2TD

Mainclock

Delayed clocks

TDTD

2TD

VoidV D

PDU

V D

D

D

V DV D

Registers

ProgramEnableMode

DONE To switch

encoder

D<1>D<0>

D<2>D<3>

...CLK

Packet<1>Packet<2>Packet<3>

Packet<0>

CMU

NIPU

SYNC

SYNC SYNC ......

...

Switch

[1]

Page 33: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 33

High-Speed Serdes

Conventional architectureShift-register + MUX

LimitationsRequires CLKSER (= RSER x CLK)Synchronization issue at a receiverMax. frequency is limited by TSETUP + TMUX +THOLD

D Q

CLKLOADb

D3 D2 D1 D0

D Q D Q D Q D Q

CLK

Q3 Q2 Q1 Q0

0

1D Q

0

1D Q

0

1D Q

EN EN EN EN

EN

EN EN EN ENSOUT

[2]

Page 34: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 34

Proposed Serdes Scheme

Wave-Front-Train (WAFT) Serdesconcept

Data separation is done by delay elements, not D-FFs

"0"

D0

delayD0

D1

delayD1

D2

delayD2

D3

delayD3

"1" (pilot signal)

delay"1"

"0"

....

delaydelaydelaydelay

Q3 Q2 Q1 Q0

ENb (=1)STOP (=0)

D0

delay

D1

delay

D1

D2

delay

D2

D3

delay

D3

"1"

delay

"1"

"0"

....

delaydelaydelaydelay

Q3 Q2 Q1 Q0

ENb (=0)STOP (=0) D0

"0"

D0

delay

D1

delay

D1

D2

delay

D2

D3

delay

D3

delay

"1"

"0"

....

delaydelaydelaydelay

Q3 Q2 Q1 Q0

ENb (=0)STOP

D0

"1"

"0"

[2]

Page 35: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 35

WAFT Serdes Operations

Schematic & timing diagrams

0

1DE

D2 D1 D0

0

1

D3

EN

VDD0

1DE

0

1DE

0

1DE

0

1DE

BUF

QS3 QS2 QS1 QS0 QP

MUXP MUXO

(pilot signal)

SOUT0

1

0

1

DE 0

1

Q2 Q1 Q0Q3

DE

/2

STOP

0

1

DE 0

1

DE DE

Serializer Deserializer

TBUF+ TU + 0.5xTDE

D0 D1 D2 D3

D0 D1 D2

D0 D1

D0

TU

TU

TU

pilot

D3

D2

D1

D0

STOP

SOUT

Q3

Q2

Q1

Q0

TBUF+ TU TU

EN

TMUX

( TU = TMUX + TDE )

TWAVE

[2]

Page 36: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 36

WAFT Deserializer

MUX vs. Latch

A

B

C

SEL

SEL

SELb

A

C

SEL

SEL

SELb

A C

SEL

SEL

SELbSELb

=

B

AC

AC

[2]

Page 37: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 37

Advantages of WAFT

High-performanceAchieves up to 4.3Gb/s @ 1.8V, 0.18μm CMOS tech.Conventional serdes: Max. 2Gb/s @ same condition

Low-power consumptionPower-consuming D-FFs are eliminated47% less power-consumption @ 2Gb/s operation

Low system overheadCLKSER is not required → No burden to clock generation circuit

[2]

Page 38: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 38

VDD Sensitivity of WAFT

Effects of supply voltage (VDD)Jitter at deserializer data if TU(ser) ≠ TU(des)

Transmitter Receiver

VDD2

VDD1

T=0

Transmitter Receiver

T=1

Transmitter ReceiverTD

TDVDD

Transmitter Receiver

T=n

T=2

[2]

Page 39: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 39

VDD Sensitivity of WAFT (Cont’d)

Effects of supply voltage (VDD) (Cont’d)

V(w0,t) = V(w1,t) =…= V(wn,t)→ No jitter

Transmitter Receiver

TD

Transmitter Receiver

TD'Transmitter Receiver

TD''Transmitter Receiver

TD''

Uniform VDD

Uniform VDD

T=0

T=1

T=n-1

T=n

. . .

[2]

Page 40: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 40

Solution for VDD Variation

VDD insensitive delayCurrent-starving logic

Constant VGS guarantees Low-VGS: Stable but slowHigh-VGS: Unstable but fast

VREFP

VREFN

VGS

VGS

0.55 0.60 0.65 0.70 0.75 0.80 0.85

1.2

1.4

1.6

1.8

2.0

2.2

2.4

TWAVE (ns)

VGS

Low VGSregion

High VGSregion

ΔVGS forzero jitter

Jitter at +10% ΔVDD

1.62V1.80V1.98V

Current-starving inverter VGS vs. delay time

[2]

Page 41: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 41

Adaptive VGS Control

VDD-dependent VREF generation

IM1

IM2

IM3

IM4

VREF generator

Current profile according to VDD variation0.5 1.0 1.5 2.0

160

140

120

100

80

60

40

20

0 IM3

I (mA)

VDD (V)

IM1

IM4 = IM1 - IM2 - IM3

VDD =1.62~1.98V

IM2

800

600

640

680

720

760

840

880

VREFN (mV)

1.81.6 2.0

Ideal V GS for zero jitterVREFN curve of the ref. generator

VDD (V)

Ideal VGS vs. generated VREF

[2]

Page 42: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 42

Link Energy-Consumption Reduction

Adaptive bandwidth controlMotivation

Max. BW = 0.35 / TR

TR depends on supply voltageWhen less bandwidth is required, supply voltage of a link can be reduced to save energy-consumption

Implementation issuesOperation frequency change of a serializer according to the supply voltage variation

BMAX

1.2 1.4 1.6 1.8

1.01.21.41.61.82.02.22.42.62.83.0

BW(Gb/s)

Supply voltage (V)

BOUTDatarate(ser&link)

0.8

Datarate(switch output)

0.8Gb/s

1.6Gb/s

Using a WAFT serdes, output signal bandwidth changes automatically according to VDD.

BMAX = Max. affordable BWBOUT = Output signal BW

[3]

Page 43: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 43

Link Energy-Consumption Reduction (Cont’d)

Energy-consumption

VDDLVDD

LVENLVENb

IN OUT

ETYP

ELV = 0.36ETYP

EHV ~ ETYP

EL2H = 1.42ETYP

EH2L ~ ETYP

phit

If N(phit) ≥ 4 @ LV, transition overhead is compensated

Dual VDD driving buffer Energy-consumption graph

[3]

Page 44: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 44

Summary

Basic architecture define

Topology Phit size

Packet Protocol

Advanced featuring

Latency

▪ OCS▪ Optimal RSER

Cached Arbitration ▪ Star, Mesh▪ N, L

▪ Aligned format

▪ Packet ordering

Energy-efficient circuits▪WAFT serdes▪ Programmable sync▪ Adaptive BW control

Page 45: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 45

Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCNChip Implementations & Measurement Results

1st chip implementation & measurement results2nd chip implementation & measurement results

Conclusion

Page 46: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 46

1st Chip Implementation

Modeling SoC environment of non-synchronized heterogeneous PUs

Each clock source can be stimulated independently by external equipments. Every component is maximally spread apart from each otherClock distribution is done manually without consideration of clock skew problem

264mW @ 800MHz, 2.3VPower

81,000(excluding 1kB SRAM)

Tr. count

0.38μm CMOS technology Tech.

10.8 x 6.0 mm2Die size

Page 47: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 47

Measurement Results

12 cycles

DATASTB

CLKNET

CLKSRAM

OENSRAM3 SRAM data

CLKNET(300MHz)

DATASTB3 Packets

Packet SRAM

OENSRAM

NI

CLKSRAM(75MHz)

DNS2

UPS2

SW

UPS1

DNS1

CLKPU1(75MHz)

12 cycles

DATASTB

CLKNET

CLKSRAM

OENSRAM3 SRAM data

CLKNET(300MHz)

DATASTB3 Packets

Packet SRAM

OENSRAM

NI

CLKSRAM(75MHz)

DNS2

UPS2

SW

UPS1

DNS1

CLKPU1(75MHz)

Single chip operation

CLKOGWOENOGW

OENSRAM

TKN_PASS

ChipA-to-B packet transaction

SRAM read

OGWBgrabs token

ChipB-to-A packet transaction

OGWA

Chip A

PLL

SW

UPS

DNS

OGWB

Chip B

PLL

SW

CLKREF

OENOGW

TKN_PASS

OENSRAM

NI

SRAM

Bus

TOKEN

CLKOGWOENOGW

OENSRAM

TKN_PASS

ChipA-to-B packet transaction

SRAM read

OGWBgrabs token

ChipB-to-A packet transaction

OGWA

Chip A

PLL

SW

UPS

DNS

OGWB

Chip B

PLL

SW

CLKREF

OENOGW

TKN_PASS

OENSRAM

NI

SRAM

Bus

TOKEN

Chip-to-chip operation

Page 48: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 48

2nd Chip Implementation

Overall ArchitectureTwo-level hierarchical star-topology

(1 level-2, 4 level-1 switches)

3 masters (traffic generators), 1 SRAM (256b)400MHz operation

Implemented featuresBurst-packet bypassingWAFT serdesProgrammable delay SYNCAdaptive bandwidth control

400MHz (1.6Gb/s) @ 1.8VMax. Freq.

409kTr. count

0.18μm CMOS technology Tech.

4.0 x 4.0 mm2Die size

Page 49: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 49

Switch Layout

Arbiter

SwitchFabricInport 1

Inport 2

Inport 3

Inport 4

Inport 5

Outport 1

Outport 2 Outport 3 Outport 4

Outport 5

SER2SER1

DES2

DES1

1200um

1000

um

12%18m x 280um2 x 2euDeserializer

10%320 x 130um2 x 2euSerializer

15%180 x 140um2 x 5euOutput port

25%320 x 136um2 x 5eu Input FIFO

8%400 x 170um2Arbiters

31%500 x 530um2Switch fabric*

PortionSizeBlock name

Switch area breakdown

* 1 port = 44b (2:1 serialized)

Page 50: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 50

Measurement Results

8:1 WAFT serialized output

0101010111010101pilot pilot

10101010pilot

“11111110” “11111111” “00000000”

Page 51: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 51

Measurement Results (Cont’d)

Eye diagram

D0 D1 D2 D3 D4 D5 D6 D7

pilot signal

clock

250ps

Page 52: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 52

Measurement Results (Cont’d)

Deserializer jitter

16.91ps 33.00ps

17.07ps 23.70ps

No noise Noise

No noise Noise

Typical WAFT

WAFT with low-jitter scheme

Page 53: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 53

Measurement Results (Cont’d)

Adaptive bandwidth control

low-bandwidth enabled

SOUT

CLK

Typical WAFT

Page 54: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 54

Measurement Results (Cont’d)

Programmable delay synchronizerProblem statement

Unstable

CLK

EN|mode A

EN|mode B

EN|mode C

1-clock delayed fetch

Page 55: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 55

Measurement Results (Cont’d)

Programmable delay synchronizerProblem statement (cont’d)

CLK

TA TB

EN

1-clock delay fetch

Page 56: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 56

Measurement Results (Cont’d)

Programmable delay synchronizerWith programmable delay synchronizer

Extension

CLK

EN|mode C

(without programmablesynchronizer)

EN|mode C

(with programmablesynchronizer)

Page 57: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 57

Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCNChip Implementations & Measurement ResultsConclusions

Page 58: Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Ph.D. Thesis Presentation, Se-Joong Lee 58

Conclusions

Cost-effective OCN architecture is definedStar vs. Mesh comparisonPacket format and protocols considering on-chip SoC situations

Cost-optimization is performedOCS for area/energy reduction

Advanced featuring is performedCAS for Latency reductionWAFT for low-power/high-performance serdesEfficient synchronizer using programmable delayAdaptive bandwidth control for link energy-consumption reduction

OCN chips are implemented

Real On-Chip Network is realized