Download pdf - Cost-Optimization and Chip Implementation of On-Chip Networkssl.kaist.ac.kr/2007/data/thesis/Se-Joong_Lee_PhD.pdf · Cost-Optimization and Chip Implementation of On-Chip Network

Cost-Optimization andChip Implementation of On-Chip Network

이세중 (Se-Joong Lee)2005. 5. 17Semiconductor System Lab,Dept. of EE, KAIST

Doctoral Thesis Presentation

Ph.D. Thesis Presentation, Se-Joong Lee 2

Outline

Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCNChip Implementations & Measurement ResultsConclusions


Introduction

System-on-Chip DesignComplexity of interconnect

Interconnects more than 10 processorsRequires a few GB/s on-chip trafficCommunication channel is bottleneck

Uncertain PHY parameters< 100nm eraProne to process variations

And, other issuesClock synchronization @ ~ GHz

On-Chip Network

CELL processor

200 nm

Oxide Spacer

FG

CG

W

60nm


Introduction (Cont’d)

OCN concept

Basic architecturedefinition

Architectureoptimization

Advancedfeaturing

Chipimplementation

Previousworks

Contributionof this work


Background

Research Issues on OCN

processing unit(node)

switch

queuing buffer

Networktopology

Pack

et

Protocol /Packet format

Routingscheme

Switchingscheme

Flow control

Packet


Background (Cont’d)

Active working groups & resultsStanford – Bologna

Netchip project = SUNMAP + XpipesCompilerTopology configuration, Automated systemC model synthesis

Royal Institute of TechnologyNostrum projectProtocol, Application-to-NoC mapping

PhilipsAEthereal projectQoS guaranteed OCN

PrincetonPower modeling and simulation

Limitations of the previous works→Lack of considerations on implementation

issues


Background (Cont’d)

Chip implementation casesPleiades

UC Berkeley, ISSCC2000Static configured Mesh-like interconnectNo concept of network

Programmable SwitchST Microelectronics, ISSCC2003Programmable switch using Flash memory structure

RAWMIT, ISSCC2003Mesh network for 16 processorsNo dedicated researches on the network architecture

MIT’s RAW

Berkeley’s Pleiades

STM’s On-Chip Switch


Research Topics in This Work

1. Network topology2. Channel width

3. Packet format/protocol

packet

4. Latency

5. Serdes

6. Low-power link

7. Mesochronous communication


Introduction BackgroundCost-Optimized OCN Architecture

[1] Topology Selection[2] On-Chip Serialization[3] Packet format/protocol[4] Latency Reduction

Circuit Design of OCNChip Implementations & Measurement ResultsConclusion


Topology Selection

General ideas on topology selectionTraffic pattern – Topology

Uniformly distributed traffic → Mesh-based topologyLocalized traffic → Hierarchical topology (ex. Tree, Hier.-Mesh)

TreeHierarchical Star

Topology selection in a levelMesh vs. Star according to

Link length# of nodes

Mesh or Star?

Cluster

Localnetwork

Globalnetwork

Hierarchical Star topology

[1]


Topology Cost Comparison

Area equations (Approximated)

LKQBSWM ANNANNANNAareaMesh ⋅−+⋅−+⋅−= )(4)45()3625()( 1

LKQBSWS ANNANANAareaStar ⋅+⋅+⋅= 12)(

)()(316 ____ QBMLKMQBSLKSQBLK AAAAAAandN +≤+→≤≤

→≤ 16N SWMSWS AA __ ≤

SWSA _ QBSA _ LKSA _

SWMA _ QBMA _ LKMA _

Rule of thumbs

MSQBLK AAAAandN ≤→⋅≤≤ 316

MSQBLK EEEEandN ≤→⋅≤≤ 216

[1]


Topology Cost Comparison (Cont’d)

Numerical analysisBased on actual implementation results

4 9 16 250.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

N

Ratio

AS / AM

L=2.0mmL=1.5mmL=1.0mmL=0.5mm

4 9 16 250.7

0.8

0.9

1.0

1.1

1.2

ES / EM

N

Ratio


[1]


Phit Size Determination

Wide vs. narrow phit sizeWide phit size

Realizes high-bandwidth easilyNot friendly in terms of P&RPossible high-energy consumption due to large building blocks

Small phit size (serialized link)Realizes small foot-print of an OCNBW limitationPossible high-energy consumption due to serdes and high clockfrequency

No previous works on phit size determination….

Optimal phit size determination is required


On-Chip Serialization (OCS)

Effect on linksLink area (ALK) is reducedELK = EWIR + EDRV

Wider space → Coupling cap. reduction → EWIR ↓

Higher frequency → Smaller tapering factor → EDRV ↑

EWIR > EDRV, therefore, ELK ↓

OCS

1 2 4 80.4

0.5

0.6

0.7

0.8

0.9

1.0

RSER

Ene

rgy

ratio

[2]


On-Chip Serialization (OCS) (Cont’d)

Effects on linksSwitching activity

No encoding: up to x3 for linearly increasing data patternEncoded serial link: <10% overhead

[2]

Parallel/serial bus transition for memory accesses of 3D operations

0

0.5

1

1.5

2

2.5

3

3.5

INST_ADDR INST_DATA DATA_ADDR DATA_DATA

parallel busserial busencoded serial



Effects on switchesESF + EARB

Arbiter operates packet-basis → OCS does not affectsSwitch fabric area/energy is reduced

PHIT

PH

IT

CJ0 CJ1

CJ= ΣCJi

CC= ΣCCi

CC0 CC1

W

PP

PP

W/RSER

)( JCSF CCWE +×= )(

)(

JSER

C

JSER

C

SERSERSF

CRCW

CRC

RWRE

+×=

+××=

[2]



Effects on queuing buffersBitline capacitance in a SRAM-type QB

RSER=1

RSER=2RSER=

4, 8, 16

88

8

44

16

22

32

PRE

WD

cell

cell

cell

BLSA

WLD

PRE

WD

cell

cell

cell

BLSAPRE

WD

cell

cell

cell

BLSA

WLD.....

... ... ...

Dual-port SRAM

AD

DR

GE

N

1 2 4 80.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

RSER

Ene

rgy

ratio

[2]


Transfer & Characteristic Matrix

Transfer matrixTransformation representing OCS operation

Characteristic matrixMatrix representing area/energy breakdown of an OCN

1.00 0.75 0.64 0.80

1.00 1.13 1.30 1.60

1.00 0.54 0.32 0.20

1.00 1.00 1.00 1.00

1:1 2:1 4:1 8:1

Link

QB

SF

Other

TME =

1.00 0.75 0.5 0.31

1.00 1.00 1.00 1.00

1.00 0.25 0.06 0.02

1.00 1.00 1.00 1.00

1:1 2:1 4:1 8:1

Link

QB

SF

Other

TMA =

CMME

1.60L 0.33 0.26 0.01

2.13L 0.33 0.58 0.01

3.20L 0.33 1.03 0.01

3.84L 0.33 1.61 0.01

Link QB SF Other

N= 4

N= 9

N=16

N=25

1.60L 0.99 0.60 0.03

2.13L 1.21 0.92 0.03

2.40L 1.32 1.09 0.03

2.56L 1.39 1.20 0.04

Link QB SF Other

N= 4

N= 9

N=16

N=25

0.72L 0.27 0.33 0.02

1.27L 0.27 0.75 0.04

1.79L 0.27 1.33 0.06

2.29L 0.27 2.08 0.10

Link QB SF Other

N= 4

N= 9

N=16

N=25

0.48L 0.32 0.50 0.04

0.85L 0.56 0.87 0.06

1.19L 0.77 1.20 0.07

1.53L 0.97 1.52 0.08

Link QB SF Other

N= 4

N= 9

N=16

N=25

CMSE CMMA CMSA

TMCMCM ×='

[2]


Optimal RSER Determination

1 2 4 8 RSER

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

(x10-10J)

Link

QB

SF

Others

Sta

r ene

rgy

RSER1 2 4 8

N=4N=9N=16N=25

(x10-10J)

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Mes

h en

ergy

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

(x105μm2)

1 2 4 8

N=4N=9N=16N=25

Mes

h ar

eaLink

QB

SF

Others

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

(x105μm2)

1 2 4 8 RSER

Star

are

a

Analyze AM, AS, EM, ESaccording to the N (4~25)1:1 ~ 8:1 serializationDefault phit size = 88bDistance b/w PUs = 1mmQB = 3-packet capacity

Optimal RSER = 4:1 OCS (or 22b phit size)

EM ES

ASAM

[2]


OCS-applied Topology Comparison

4 9 16 250.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0L=2.0mmL=1.5mmL=1.0mmL=0.5mm

N

Ratio

AS / AM

0.7

0.8

0.9

1.0

1.1

1.2

4 9 16 25

ES / EM

N

4 9 16 250.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

N

Ratio

AS / AM


4 9 16 250.7

0.8

0.9

1.0

1.1

1.2

ES / EM

N

Ratio


Area Reduction

Energy Reduction

[2]


Packet Format & Protocol

Typical vs. proposed packet formatTypical format

Layered architecturePhit, flit, and packet are defined independently

Proposed formatAligned architecturePhit and packet formats are tightly correlated

→ Low decoding-overhead (50% power reduction)→ Easy to scale sub-fields of a packet

Packet

Flit 1 . . .

Flit header

Flit 2 Flit 3 Flit n

SW header(16b)

NI header (4b)

PL0 (4b)

PL1(32b)

PL2(32b)

4b

1b

1b

8b

8b

link<3:0>

link<4>

link<5>

link<13:6>

link<21:14>

4b

88b

Typical formatProposed format

[3]


Packet Format & Protocol (Cont’d)

Packet orderingProblem statement

Packet delivery time depends on network conditionWhen multiple packets are issued, the packet sequence could be disordered

Typical solutionPacket sequence number ← Requires large queue

Proposed solutionUses fixed routing NI controls read-packets

NI SW

NI

NI

A

B

C

C B BB

Hold 'A' until all the packettransactions between 'B' complete

SW

[3]


Latency Reduction

MotivationOCN = Set of switches and QBs

Provides sufficient bandwidth, but suffers from large end-to-end latency

Propagation time vs. Switching timeSwitching time takes higher portion

SW SW

Tswitching

Tprop

2002 2003 2004 2005 2006 2007 2008 2009 2010 20110

200

400

600

800

1000

1200

1400

Del

ay ti

me

(pse

c)

Year

Tswitching

Tprop

[4]

Based on ITRS roadmap data


Latency Reduction (Cont’d)

Previous work (AEthereal*)

Contention-free routingEliminates queuing and arbitration delaySwitches use slot tableGuarantees min. bandwidth and max. latency

LimitationsInitial packet transmission requires waiting-time for the time-slotRequires latching at every switchHard to apply to a mesochronous OCN

A

B

P

1 2 3

SW 1 SW 2 SW 3

[4]

*E. Rijpkema, et al., “Trade-offs in the Design of A router…”, DATE2003


Proposed Scheme

SW 2 SW 3SW 1

A

B

Additional line

Storing packet route using actual connection !

Concept

[4]


Proposed Scheme

ConceptStores information of frequently used route

SW 2 SW 3SW 1

A

B

Additional line

[4]

Cached Arbitration


Cached-Arbitration Scheme

OperationsAR-path connection / disconnectionPR-path reservation / release

Design issueSynchronization issues on AR-path connection/disc. & PR-path reservation/release

Design overhead4b AR-link, ARARB, ARSW

AR-path setup time2xTprop + Tcycle

cf) Packet-based circuit path setup time = 2xTprop + NxTcycle

SWSW

A

AR-link

B

PR-link

AR-path

PR-path

ARRACKDISC

ARSP

ARRACKDISC

SW

ARSF

ARARB

NIFTR FTR

NI

[4]


Summary

Basic architecture define

Topology Phit size

Packet Protocol

Advanced featuring

Latency

Circuit-levelLow-power/high-perf. schemes

▪ OCS▪ Optimal RSER

Cached Arbitration ▪ Star, Mesh▪ N, L

▪ Aligned format

▪ Packet ordering


Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCN

[1] Mesochronous Communication[2] Wave-Front-Train: A New Serdes Scheme[3] Adaptive Bandwidth Control

Chip Implementations & Measurement ResultsConclusion


Mesochronous Communication

MesochronousfCLK (transmitter) = fCLK (receiver)φ (transmitter) ≠ φ (receiver)

Issues in mesochronous signalingUnknown latch-timingPossible failure due to metastable

Conventional solutionsPipeline synchronizerFIFO synchronizer

in p u ts ig n a l

c lk

la tc h e ds ig n a l

DQ DQ

FIFO

[1]


Programmable Delay Synchronizer

MotivationOn-chip situations causing phase variation

Packet-bypassing technique (physical distance variation)Supply voltage control (electrical distance variation)Network clock frequency variation (reference timing variation)

“Phase variation” is unknown but quantized

My approachProgrammable delay

Memorize appropriate delay for a given network-modeRecall the delay value according to the network-mode

V D

Registers

Network mode

PhaseDetector

Phase information

Inputsignal

Synchronized output

[1]


Programmable Delay Synchronizer (Cont’d)

CalibrationInput signal sampling @ 0, TD, 2TD

Mainclock

Delayed clocks

TDTD

2TD

VoidV D

PDU

V D

D

D

V DV D

Registers

ProgramEnableMode

DONE To switch

encoder

D<1>D<0>

D<2>D<3>

...CLK

Packet<1>Packet<2>Packet<3>

Packet<0>

CMU

NIPU

SYNC

SYNC SYNC ......

...

Switch

[1]


High-Speed Serdes

Conventional architectureShift-register + MUX

LimitationsRequires CLKSER (= RSER x CLK)Synchronization issue at a receiverMax. frequency is limited by TSETUP + TMUX +THOLD

D Q

CLKLOADb

D3 D2 D1 D0

D Q D Q D Q D Q

CLK

Q3 Q2 Q1 Q0

0

1D Q

0

1D Q

0

1D Q

EN EN EN EN

EN

EN EN EN ENSOUT

[2]


Proposed Serdes Scheme

Wave-Front-Train (WAFT) Serdesconcept

Data separation is done by delay elements, not D-FFs

"0"

D0

delayD0

D1

delayD1

D2

delayD2

D3

delayD3

"1" (pilot signal)

delay"1"

"0"

....

delaydelaydelaydelay

Q3 Q2 Q1 Q0

ENb (=1)STOP (=0)

D0

delay

D1

delay

D1

D2

delay

D2

D3

delay

D3

"1"

delay

"1"

"0"

....


Q3 Q2 Q1 Q0

ENb (=0)STOP (=0) D0

"0"

D0

delay

D1

delay

D1

D2

delay

D2

D3

delay

D3

delay

"1"

"0"

....


Q3 Q2 Q1 Q0

ENb (=0)STOP

D0

"1"

"0"

[2]


WAFT Serdes Operations

Schematic & timing diagrams

0

1DE

D2 D1 D0

0

1

D3

EN

VDD0

1DE

0

1DE

0

1DE

0

1DE

BUF

QS3 QS2 QS1 QS0 QP

MUXP MUXO

(pilot signal)

SOUT0

1

0

1

DE 0

1

Q2 Q1 Q0Q3

DE

/2

STOP

0

1

DE 0

1

DE DE

Serializer Deserializer

TBUF+ TU + 0.5xTDE

D0 D1 D2 D3

D0 D1 D2

D0 D1

D0

TU

TU

TU

pilot

D3

D2

D1

D0

STOP

SOUT

Q3

Q2

Q1

Q0

TBUF+ TU TU

EN

TMUX

( TU = TMUX + TDE )

TWAVE

[2]


WAFT Deserializer

MUX vs. Latch

A

B

C

SEL

SEL

SELb

A

C

SEL

SEL

SELb

A C

SEL

SEL

SELbSELb

=

B

AC

AC

[2]


Advantages of WAFT

High-performanceAchieves up to 4.3Gb/s @ 1.8V, 0.18μm CMOS tech.Conventional serdes: Max. 2Gb/s @ same condition

Low-power consumptionPower-consuming D-FFs are eliminated47% less power-consumption @ 2Gb/s operation

Low system overheadCLKSER is not required → No burden to clock generation circuit

[2]


VDD Sensitivity of WAFT

Effects of supply voltage (VDD)Jitter at deserializer data if TU(ser) ≠ TU(des)

Transmitter Receiver

VDD2

VDD1

T=0


T=1

Transmitter ReceiverTD

TDVDD


T=n

T=2

[2]


VDD Sensitivity of WAFT (Cont’d)

Effects of supply voltage (VDD) (Cont’d)

V(w0,t) = V(w1,t) =…= V(wn,t)→ No jitter


TD


TD'Transmitter Receiver

TD''Transmitter Receiver

TD''

Uniform VDD

Uniform VDD

T=0

T=1

T=n-1

T=n

. . .

[2]


Solution for VDD Variation

VDD insensitive delayCurrent-starving logic

Constant VGS guarantees Low-VGS: Stable but slowHigh-VGS: Unstable but fast

VREFP

VREFN

VGS

VGS

0.55 0.60 0.65 0.70 0.75 0.80 0.85

1.2

1.4

1.6

1.8

2.0

2.2

2.4

TWAVE (ns)

VGS

Low VGSregion

High VGSregion

ΔVGS forzero jitter

Jitter at +10% ΔVDD

1.62V1.80V1.98V

Current-starving inverter VGS vs. delay time

[2]


Adaptive VGS Control

VDD-dependent VREF generation

IM1

IM2

IM3

IM4

VREF generator

Current profile according to VDD variation0.5 1.0 1.5 2.0

160

140

120

100

80

60

40

20

0 IM3

I (mA)

VDD (V)

IM1

IM4 = IM1 - IM2 - IM3

VDD =1.62~1.98V

IM2

800

600

640

680

720

760

840

880

VREFN (mV)

1.81.6 2.0

Ideal V GS for zero jitterVREFN curve of the ref. generator

VDD (V)

Ideal VGS vs. generated VREF

[2]


Link Energy-Consumption Reduction

Adaptive bandwidth controlMotivation

Max. BW = 0.35 / TR

TR depends on supply voltageWhen less bandwidth is required, supply voltage of a link can be reduced to save energy-consumption

Implementation issuesOperation frequency change of a serializer according to the supply voltage variation

BMAX

1.2 1.4 1.6 1.8

1.01.21.41.61.82.02.22.42.62.83.0

BW(Gb/s)

Supply voltage (V)

BOUTDatarate(ser&link)

0.8

Datarate(switch output)

0.8Gb/s

1.6Gb/s

Using a WAFT serdes, output signal bandwidth changes automatically according to VDD.

BMAX = Max. affordable BWBOUT = Output signal BW

[3]


Link Energy-Consumption Reduction (Cont’d)

Energy-consumption

VDDLVDD

LVENLVENb

IN OUT

ETYP

ELV = 0.36ETYP

EHV ~ ETYP

EL2H = 1.42ETYP

EH2L ~ ETYP

phit

If N(phit) ≥ 4 @ LV, transition overhead is compensated

Dual VDD driving buffer Energy-consumption graph

[3]


Summary

Basic architecture define

Topology Phit size

Packet Protocol

Advanced featuring

Latency

▪ OCS▪ Optimal RSER

Cached Arbitration ▪ Star, Mesh▪ N, L

▪ Aligned format

▪ Packet ordering

Energy-efficient circuits▪WAFT serdes▪ Programmable sync▪ Adaptive BW control


Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCNChip Implementations & Measurement Results

1st chip implementation & measurement results2nd chip implementation & measurement results

Conclusion


1st Chip Implementation

Modeling SoC environment of non-synchronized heterogeneous PUs

Each clock source can be stimulated independently by external equipments. Every component is maximally spread apart from each otherClock distribution is done manually without consideration of clock skew problem

264mW @ 800MHz, 2.3VPower

81,000(excluding 1kB SRAM)

Tr. count

0.38μm CMOS technology Tech.

10.8 x 6.0 mm2Die size


Measurement Results

12 cycles

DATASTB

CLKNET

CLKSRAM

OENSRAM3 SRAM data

CLKNET(300MHz)

DATASTB3 Packets

Packet SRAM

OENSRAM

NI

CLKSRAM(75MHz)

DNS2

UPS2

SW

UPS1

DNS1

CLKPU1(75MHz)

12 cycles

DATASTB

CLKNET

CLKSRAM

OENSRAM3 SRAM data

CLKNET(300MHz)

DATASTB3 Packets

Packet SRAM

OENSRAM

NI

CLKSRAM(75MHz)

DNS2

UPS2

SW

UPS1

DNS1

CLKPU1(75MHz)

Single chip operation

CLKOGWOENOGW

OENSRAM

TKN_PASS

ChipA-to-B packet transaction

SRAM read

OGWBgrabs token

ChipB-to-A packet transaction

OGWA

Chip A

PLL

SW

UPS

DNS

OGWB

Chip B

PLL

SW

CLKREF

OENOGW

TKN_PASS

OENSRAM

NI

SRAM

Bus

TOKEN

CLKOGWOENOGW

OENSRAM

TKN_PASS

ChipA-to-B packet transaction

SRAM read

OGWBgrabs token

ChipB-to-A packet transaction

OGWA

Chip A

PLL

SW

UPS

DNS

OGWB

Chip B

PLL

SW

CLKREF

OENOGW

TKN_PASS

OENSRAM

NI

SRAM

Bus

TOKEN

Chip-to-chip operation


2nd Chip Implementation

Overall ArchitectureTwo-level hierarchical star-topology

(1 level-2, 4 level-1 switches)

3 masters (traffic generators), 1 SRAM (256b)400MHz operation

Implemented featuresBurst-packet bypassingWAFT serdesProgrammable delay SYNCAdaptive bandwidth control

400MHz (1.6Gb/s) @ 1.8VMax. Freq.

409kTr. count

0.18μm CMOS technology Tech.

4.0 x 4.0 mm2Die size


Switch Layout

Arbiter

SwitchFabricInport 1

Inport 2

Inport 3

Inport 4

Inport 5

Outport 1

Outport 2 Outport 3 Outport 4

Outport 5

SER2SER1

DES2

DES1

1200um

1000

um

12%18m x 280um2 x 2euDeserializer

10%320 x 130um2 x 2euSerializer

15%180 x 140um2 x 5euOutput port

25%320 x 136um2 x 5eu Input FIFO

8%400 x 170um2Arbiters

31%500 x 530um2Switch fabric*

PortionSizeBlock name

Switch area breakdown

* 1 port = 44b (2:1 serialized)


Measurement Results

8:1 WAFT serialized output

0101010111010101pilot pilot

10101010pilot

“11111110” “11111111” “00000000”


Measurement Results (Cont’d)

Eye diagram

D0 D1 D2 D3 D4 D5 D6 D7

pilot signal

clock

250ps



Deserializer jitter

16.91ps 33.00ps

17.07ps 23.70ps

No noise Noise

No noise Noise

Typical WAFT

WAFT with low-jitter scheme



Adaptive bandwidth control

low-bandwidth enabled

SOUT

CLK

Typical WAFT



Programmable delay synchronizerProblem statement

Unstable

CLK

EN|mode A

EN|mode B

EN|mode C

1-clock delayed fetch



Programmable delay synchronizerProblem statement (cont’d)

CLK

TA TB

EN

1-clock delay fetch



Programmable delay synchronizerWith programmable delay synchronizer

Extension

CLK

EN|mode C

(without programmablesynchronizer)

EN|mode C

(with programmablesynchronizer)


Introduction BackgroundCost-Optimized OCN ArchitectureCircuit Design of OCNChip Implementations & Measurement ResultsConclusions


Conclusions

Cost-effective OCN architecture is definedStar vs. Mesh comparisonPacket format and protocols considering on-chip SoC situations

Cost-optimization is performedOCS for area/energy reduction

Advanced featuring is performedCAS for Latency reductionWAFT for low-power/high-performance serdesEfficient synchronizer using programmable delayAdaptive bandwidth control for link energy-consumption reduction

OCN chips are implemented

Real On-Chip Network is realized