고성능 시스템 온칩용 저전력 네트워크 온칩의 설계 및...

고성능 시스템 온칩용

저전력 네트워크 온칩의 설계 및 구현

Design and implementation of

Low-Power Network-on-Chip for Application to

High-Performance System-on-Chip Design

Design and implementation of Low-Power Network-on-Chip for Application to

High-Performance System-on-Chip Design

ADVISOR: Professor Yoo, Hoi-Jun By

Kangmin Lee

Department of Electrical Engineering and Computer Science

Division of Electrical Engineering

Korea Advanced Institute of Science and Technology

A THESIS SUBMITTED TO THE FACULTY OF THE KOREA

ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY IN

PARTIAL FULFILLMENT OF REQUIREMENTS OF THE DEGREE OF

DOCTOR OF PHILOSOPHY IN THE DEPARTEMENT OF ELECTRICAL

ENGINEERING AND COMPUTER SCIENCE, DIVISION OF

ELECTRICAL ENGINEERING.

DAEJEON, KOREA

2005. 12. 11

APPROBYED BY

__________________

Professor Yoo, Hoi-Jun

고성능 시스템 온칩용

저전력 네트워크 온칩의 설계 및 구현

이 강 민

위 논문은 한국과학기술원 박사학위 논문으로

학위논문 심사위원회에서 심사 통과하였음.

2005 년 12 월 2 일

심사위원장 유회준 (인)

심사위원 박규호 (인)

심사위원 김정호 (인)

심사위원 신영수 (인)

심사위원 이혁재 (인)

20025851

이 강 민. Lee, Kangmin. Design and implementation of Low-Power Network-on-Chip for Application to High-Performance System-on-Chip Design. 고성능 시스템 온칩 용 저전력 네트워크 온칩의 설계 및 구현. Department of Electrical Engineering and Computer Science, Division of Electrical Engineering. 2005. 152p. Advisor Professor Yoo, Hoi-Jun. Text in English

Abstract

A low-power packet-switched Network-on-Chip (NoC) is designed with

hierarchical star topology and implemented in real silicon for possible application to

high-performance SoCs. This dissertation presents how to obtain low power

consumption in NoC while the whole NoC design process is covered from the

architecture decision to the system demonstration.

First, a performance and cost oriented topology exploration is performed. The

evaluated topologies include not only flat topologies such as a bus, mesh, star and

point-to-point but also sixteen hierarchical and heterogeneous topologies. The

evaluation method uses technology-independent analytical models with

implementation-based physical parameters.

Second, the detail network architecture such as switching method, packet

synchronization, link serialization, protocol and buffering schemes are analyzed

with special emphasis on low power consumption. The implemented chip contains

two RISC processors for multiprocessor emulation, two 64kb SRAMs, an on-chip

FPGA, an off-chip gateway for interfacing to outer network, three 4kb SRAMs for

peripheral logic emulation, 1.6GHz PLL for internal clock generation, and on-chip

networks connecting those processing units. On-chip network channel is serialized

from 80bits onto 8bits to reduce the network area and complexity of the network.

Source-synchronous signaling enables plesiochronous communications between

processing units running at different clock frequencies. Low-power consumption is

achieved by adopting various techniques such as lower swing signaling on a global

link, Mux-Tree based round-robin scheduler in a router, crossbar partial activation,

low-energy serial-link coding and clock frequency scaling. The chip consumes less

than 160mW and the on-chip network consumes less than 51mW delivering

11.2GB/s aggregated network bandwidth. The power consumption per bandwidth is

a ninth of the previous study. The 5x5mm2 chip is fabricated with 0.18µm CMOS

process and a system evaluation board demonstrated on multimedia applications

successfully. Multiple NoCs are integrated in a single BGA package to organize

Networks-in-Package (NiP) for large scalable systems with low-cost.

Contents

Abstract

1. Introduction ………………………………………………………………. 1

2. Topology Exploration …………………………………………………….. 6

2.1 Methodology Description 2.2. Energy Exploration 2.3 Area Exploration 2.4 Performance Exploration 2.5 Discussion 2.6 Model Verification with Case-Study Examples 2.7 Summary

3. Network Architecture ………………………………………………...… 47 3.1 Circuit Switching and Packet Switching 3.2 Synchronization 3.3 Serialization 3.4 NoC Protocol

4. Low-Power Techniques …………………………………………………. 59 4.1 Low-swing signaling 4.2 Mux-Tree based Round-Robin Scheduler 4.3 Crossbar Partial Activation Technique 4.4 Low-Energy Coding on On-chip Serial Link

5. Implementation & Measurements …………………………………...… 87 5.1 Design Flow and Methodology 5.2 Chip Implementation and Measurements 5.3 Network-in-Package 5.4 Demonstration System

6. Conclusions ……………………………………………………………… 99 Appendix: BONE-2 Protocol Specification …........................................... 101

Summary …………………………………………………………….……. 144

Bibliography ………………………………………………………….…… 146

Acknowledgement ………………………………………………………… 153

Chapter 1 Introduction

In the nanoelectronics era, System-on-Chip (SoC) design has many opportunities and

many difficulties as well. More than billion transistors are expected to be integrated on

a single chip in this decade with numerous Intellectual Property (IP) blocks and

multiprocessors. According to the International Technology Roadmap for

Semiconductors, before the end of the decade, 50nm CMOS at 10GHz clock speeds

will be readily available [1]. With a chip area of about 400mm2 which equal to the area

of the 64b-Itanium processor, over a thousand microprocessor cores (or modules of

comparable complexity) may be integrated on to a single chip. For the implementation

of these large scale SoCs, one of the most challengeable tasks is the ever-increasing

power consumption. Another challenge is complicated interconnection between

integrated devices.

uP DSP GraphicsEngine FPGA

Memories Peripheral IPsPMU

networkinterface

router(switch)

individualclock sources

Fig. 1.1 Heterogeneous Network-on-Chip architecture

Wire delays have become critical as compared to gate delays, causing

synchronization problems among IPs. This trend even worsens as the clock

frequencies increase and the feature sizes decrease [1]. Moreover the interconnections

including clock wires spreading over the whole chip area are readily influenced by

process uncertainty or physical disturbance of nanotechnology. The performance of

SoCs will depend on the capability to efficiently interconnect the multiple predefined

and pre-verified IPs in accommodation with their communication requirements [2].

Furthermore, communications among IPs consumes significant portion of overall

system power budget.

Recently, Network-on-Chip (NoC) architectures are emerging as a scalable,

reliable, and highly modular on-chip communication infrastructure for SoC design [3].

The NoC architecture uses layered protocols and packet-switched networks which

consist of on-chip routers (or switches), links, and network interfaces on a predefined

topology, as depicted in Fig. 1.1. The NoC methodology intends to design SoC with

Plug-and-Play fashion akin to the conventional computer Internet thanks to the

modular structure. Instead of interconnecting chip modules at the top-level using an

ad-hoc routing of dedicated global wires, as is done today, a better approach is to

interconnect them by a structured on-chip packet switching network that routes

packets between them. The advantage of using a NoC approach include both

modularity and performance benefits. There have been many architectural and

theoretical studies on NoCs such as design methodology [3], [4], topology exploration

[2], [5], QoS guarantee [6], reliable transmission [7], software issues [8] and test-and-

verifications [23]. However, only a few were implemented and verified on the silicon

and moreover they were not energy-efficient [9], [10]. For large scale NoC

implementations, the power consumption on the network infrastructure should be

minimized in order for reliable transmission with low-cost.

The conventional low-power techniques are devised such as dynamic

voltage/frequency scaling, power supply gating with sleep transistors, clock gating,

bus coding and so on. The object of this thesis is to extending such low-power design

methodologies to the NoC design field and verifying them in NoC design

environments.

In this study [11], I designed and implemented a hierarchically star-connected NoC

with various low-power techniques. The chip contains heterogeneous IPs such as two

RISC processors, multiple memory arrays, FPGA, off-chip network interfaces, and

PLL. The integrated on-chip network provides 11.2GB/s aggregate bandwidth and

consumes 51mW at full traffic condition. In the other hand, the previous work [9]

consumes 264mW with 6.4GB/s bandwidth. The ratio of power consumption to

providing bandwidth of this study is reduced by ten times from the previous works.

Large scale SoCs with huge chip size such as embedded memory logic systems

often suffer from their low yield and high cost problems. To cope with the problems,

System-in-Package (SiP or System-on-Package) techniques are emerged. However, the

current SiP technology focuses on the fabrication process rather than the

interconnecting methodology between the partitioned chips. The NoCs have so

modular structure that large system can be divided into several parts or several chips

to mitigate such problems. In this thesis, four NoCs are mounted on a single chip for

larger system emulation to make Networks-in-Package (NiP) with the seamless NoC

protocol, a new family of SiP.

The organization of this thesis is as follows. Multiple basic and hierarchical

topologies are evaluated in aspects of performance and cost in section 2. The NoC

architecture will be discussed in Section 3, and the low-power techniques will be

presented in Section 4. The implementation and measurements results and

demonstration of the NoC and NiP will be followed in Section 5, and finally, the

conclusion of our work will be summarized in Section 6.

Chapter 2 Topology Exploration

Recently, the state-of-the-art chips are integrating multiple processing cores to scale

up their performance rather than accelerating a single core with a faster clock [25].

This trend can be observed not only in high-end servers and desktop computers but

also in diverse embedded applications such as entertainment devices and mobile

terminals [26-27]. As the trend is accelerating more and more, the on-chip

communication infrastructure between cores necessitates higher bandwidth and more

scalable architecture rather than a conventional shared-bus structure. To cope with

these requirements, a concept of Network-on-Chip (NoC) was proposed [28] and

implemented [11] as a packet switching interconnection network with a scalable

topology. The NoC is well-known to provide sufficient bandwidth and throughput by

using non-blocking switching fabrics and packet-multiplexing channel. However, the

power and area cost of the on-chip network have not been clearly examined in

comparison with that of the conventional bus architecture. Moreover, the performance,

power and area cost are strongly dependent on the network topology [29]. Therefore it

is crucial to choose the optimal topology which meets the performance requirements

and the energy and area budget.

There have been researches on the topology exploration for NoCs. Murali et al.

developed a tool for automatically selecting an application-specific topology as

minimizing average communication delay, area and power dissipation [5]. Wang et al.

presented a technology-aware topology exploration of various meshes/tori [29].

Kreutz et al. presented a topology evaluation engine based on heuristic optimization

algorithm [30].

In these prior works, the candidate pool of the topologies was limited to the regular

and homogeneous topologies like a mesh, torus, cube, tree or multistage network.

Moreover a comparison with the conventional bus architecture was not sufficiently

studied, which is the practical concern of the field engineers. In the heterogeneous

SoCs like embedded or mobile systems, however, the communication flows are

certainly localized, not uniformly distributed [2-3]. In this case, it is highly possible

that the optimal topology can be a heterogeneous and hierarchical topology rather than

a homogeneous and flat topology. For example, George et al. proposed a hybrid

interconnecting structure incorporating locally point-to-point, semi-globally mesh and

globally-tree for a low-power FPGA application [31] which lead to the Pleiades chip

implementation [56]. Therefore we need to investigate such hierarchical and

heterogeneous topologies in more detail.

In this study, I present an analytic methodology to predict the performance, energy

and area cost of various topologies – not only basic topologies including bus, mesh,

star and point-to-point topologies but also hierarchical and hybrid topologies, for

example, a hierarchical bus, local-star global-mesh or local-bus global-star topology.

In this analysis, analytical models are proposed and physical parameters depending on

the process technologies and circuit designs are used. This work reveals the detailed

relationship of the performance/energy/area characteristics with the number of

integrated cores and various traffic patterns.

This paper organized as follows. Section 3 presents the candidate topologies to be

explored and the exploration method including the performance and cost model and a

traffic model. In section 4, 5 and 6, the energy, area and performance properties of the

topologies are examined, respectively. In section 7, the results from the proposed

models are compared with real implementation results for the validation of the

proposed models. In section 8, the most performance and cost-efficient topology is

discussed. Finally the paper concludes with section 9.

2.1 Methodology Description

2.1.1 Topology Pool

Topologies are categorized into two groups in this analysis: flat topologies and

hierarchical topologies. Fig. 2.1 shows flat topologies such as a bus, star, mesh, and

point-to-point and also hierarchical topologies, for example local-bus global-star,

local-star global-mesh and local-star global-star. The hierarchical topologies consist

of a local and global network topology where the local and global network can have

any type of the basic topologies. I comparatively analyze the four flat topologies and

the sixteen hierarchical topologies.

Fig. 2.1: Topology pool: (a) ~ (d) basic flat topologies and (e) ~ (g) hierarchical topologies as examples.

- 10 -

2.1.2 Assumptions

I assume that the size of each processing element (PE) is uniform as 1mm x 1mm

and the PUs are placed as a square matrix regardless of the topology as shown in Fig.

2.1. A PE could be a single processor or a sub-system such as a multimedia accelerator,

a memory system or an external interface. Each PE can behave as a master (initiator)

or a slave (target) depending on its operation. The number of PEs, N, scales from 16 to

100. The hierarchical topology is assumed to be divided into N of clusters and each

cluster contains N of PEs.

The bus and point-to-point topologies don’t have internal data buffers in their

interconnection networks but could have them in their interface. Meanwhile the star

and mesh topologies have internal packet buffers in their every switching hop. The

buffer capacity in each switch is determined by considering flow control mechanism

and congestion level of the switch. The transaction unit is a packet which is composed

of 16bit header and 64bit payload (32bit address and 32bit data). The packet is

serialized onto a unidirectional 10-bit link which consists of 8-bit packet signals, a 1-

bit STROBE signal as a timing reference and a 1-bit End-Of-Packet signal [11]. I

assume that the clocks of the integrated PEs are plesiochronous i.e. they have different

frequencies of their own and are not synchronized each other.

- 11 -

2.1.3 Traffic Model

There are two kinds of traffic patterns; one is uniform random traffic and the other

is localized traffic with a locality factor, α, which value is between 0 and 1. The

locality factor means a ratio of the intra-cluster traffic to the overall traffic as

illustrated in Fig. 2.2. As the α gets close to 1, the traffic becomes highly localized i.e.

most of transactions occur within an intra-cluster domain. If the α is 0.5, a half of the

traffic is intra-cluster domain and the other half is inter-cluster domain. It is obvious

that PEs with low latency and large bandwidth communication can get more

synergetic performance by placing them in the same cluster based on their

communication locality so that the intra-cluster traffic becomes dominant. In such a

heterogeneous system, the locality factor can represent the localized traffic pattern

quantitatively.

Fig. 2.2: Locality Factor.

- 12 -

2.1.4 Energy, Area and Latency Models for Networks-on-Chip

I use an average packet traversal energy Epkt as a network energy efficiency metric

which can be estimated by the following equation, summing up the energies on

switching hops, links and a final destination buffer [29].

QueueLinkAvgAvgSFAvgARBQueueAvgpkt EELSSESSEEHE +⋅+⋅+⋅+⋅= )( (1)

where HAvg and LAvg are average hop counts and an average distance, respectively,

between a sender PE and a receiver PE. SSAvg is an average switch size i.e. a number of

I/O ports in a switch. Energy consumption on a switching hop is composed of energy

consumption in an input queuing buffer or latch, EQueue, switching fabric, ESF and

arbitration logic, EARB. ELink stands for transmission energy on a unit-length link. Those

energy terms are measured from the circuit implementation in 0.18µm technology as

shown in Table 2.1.

The area cost of a network can be derived by summing up the area of switches and

links.

LinkTotAvgSFAvgARBAvgQueueTotTot ALSSASSASSAHA ⋅+⋅+⋅+⋅⋅= )( 2 (2)

where HTot and LTot are total hop counts and total link length on the network,

respectively. The physical area of a queuing buffer, an arbiter and a switch fabric is

measured from the real circuit layout in 0.18µm technology (See Table 2.1).

- 13 -

The latency through the network can be derived by accumulating the hop delay and

link delay.

LinkAvgSFARBQueueSyncAvgLatency TLTTTTHT ⋅++++⋅= )( (3)

where the hop delay is the sum of the signal synchronization delay (TSync) which

occurs when a packet traverses from switch to switch, the queuing delay (TQueue), the

arbitration delay (TARB) and the switching fabric delay (TSF). The design and

technology dependent parameters are measured from the post-layout simulation as

shown in Table 2.1.

TABLE 2.1: PHYSICAL PARAMETERS IN 0.18µm CMOS TECHNOLOGY [11].

Category Description Typical value Sym.

Buffer (write/read) 1.97 x 10-10 EQueue

Switching fabric/port 6.25 x 10-12 ESF

2:1 multiplexer 3.04 x 10-12 EMUX

Arbitration/port 1.79 x 10-13 EARB

1-mm link 4.38 x 10-11 ELink

Energy (J)

/ 1-packet

1-mm link (P-to-P)1 8.76 x 10-11 ELink_PtP

1 The point-to-point topology consumes much more metal routing resources than other topologies do. Therefore upper metal layers should be fully used. This situation increases the wire metal coupling capacitance vertically, thus the link energy consumption also increases.

- 14 -

Processing Element (PE) 1 x 1mm2 (LPE)2 Dimension

(mm2) Cluster Unit 44 NN × mm2 (LCU)2

3-packet queuing buffer 8.40 x 104 AQueue

Crossbar-fabric 1.47 x 103 × (# of s/w

ports)2 ASF

M:1 10b-Multiplexer 9.52 x 102 × (M-1) AMUX

Arbitration logic 2.70 x 103 × (# of s/w ports) AARB

Area (µm2)

20b 1-mm link 3.80 x 104 ALink

Arbiter 1/3 x log2(# of s/w ports) TARB

Switching fabric 6.5 x 10-3 x (# of s/w ports)2 TSF Latency (ns)

1-mm link (repeated

link) 0.42 TLink

- 15 -

2.2. Energy Exploration

2.2.1 Bus Topology

A conventional Mux-based bus structure [33] has two unidirectional buses; one is

from masters to slaves and the other is from slaves to masters as shown in Fig. 3(a). In

this master/slave bus topology, direct interconnections between two masters are

impossible thus they share a memory to communicate with each other. Meanwhile, the

master/slave bus cannot support the direct message passing between two masters. To

cope with the limitation of connectivity, a fully connected bus can be used as

illustrated in Fig. 2.3(b). It has a single shared bus which connects all of the PEs

regardless of the types of master/slave. Fig. 2.3(c) shows a hierarchical bus structure

in which local buses are master/slave buses and a global bus is a fully-connected bus.

By using the fully-connected bus in the global network instead of the master/slave bus,

direct access between two clusters is possible without a shared memory.

Fig. 2.3: Bus topologies.

- 16 -

The following equations show the Epkt of each bus in Fig. 2.3.

(a) Master/slave bus: (a # of masters: N/2, a # of slaves: N/2)

QueuePELinkMUXARBMSBpkt ELNNENENEE +

⎥⎥⎦

⎢⎢⎣

⎡⋅

⎭⎬⎫

⎩⎨⎧

+⋅+⎭⎬⎫

⎩⎨⎧ −⋅+⋅=

2221)1

The multiplexer and arbiter on the M S bus have N/2 inputs. The average distance

from a master to the multiplexer can be derived as PELN ⋅2/2/1 and the length of

the shared-bus is PELN ⋅2/ .

(b) Fully connected bus

QueuePELinkMUXARBFBpkt ELNENENEE +⎥⎦

⎤⎢⎣⎡ ⋅

−⋅+−⋅+⋅=

2)1(3)1(

The multiplexer and arbiter on the bus have N inputs. The average distance from a

master to the last multiplexer is PELN ⋅− 2/)1( and the length of the share-bus

is PELN ⋅− )1( .

(c) Hierarchical bus

QueueFB

GlobalpktMSB

LocalpktMSB

LocalpktHBpkt EEEEE +−×+⋅+⋅= )1()2( ___ αα

The EHBpkt can be derived by summing the local traverse energy and global traverse

- 17 -

energy according to the traffic locality factor. The local and global traverse energy can

be obtained from equation (1) and (2), respectively, by replacing N with N .

x 10-9 [J]Uniform

Epkt Traffic

20 30 40 50 60 70 80 90 1000

3.0Fully-connected bus

Hierarchical BusMaster/slave bus

Fig. 2.4: Energy comparison of three bus topologies.

Fig. 2.4 shows the Epkt of the three buses according to the number of PEs with various

traffic patterns. Under uniform traffic, the flat master/slave bus outperforms the

hierarchical-bus. However, as the traffic gets localized, i.e. realistic, the energy

consumption of the hierarchical bus is significantly reduced. As I expected, a

hierarchical topology has the best energy efficiency rather than flat bus topologies

under the localized traffic.

2.2.2 Mesh Topology

The following equation shows the Epkt of a 2-D flat mesh.

[ ] QueuePULinkAvgSFARBQueueFMpkt ELESSEEENNE +⋅+⋅++×⎟

⎠⎞

⎜⎝⎛ −⋅+⋅= )()1(

32 4 αα

- 18 -

In the equation, 43/2 N⋅ and N⋅3/2 are average hop counts of local and

global transactions, respectively.

Fig. 2.5: Energy consumption of a mesh.

Fig. 2.5(a) shows the EFMpkt compared with bus topologies. Under uniform traffic, the

mesh topology shows better energy efficiency than the flat bus does when N is larger

than 36. As the traffic gets localized, the energy consumption of the mesh decreases

but it is still higher than that of the hierarchical-bus topology. The mesh topology

shows flatter slope than bus topologies do as the size of network increases. The mesh

topology is known to be more scalable than the bus topology. This work reveals the

trends with quantitative figures in aspect of energy consumption.

It is also interesting to compare the energy consumption between hops (switches) and

links as shown in Fig. 2.5(b). The energies on hops are much higher than the energies

on links, about 5 ~ 8 times, in this implementation condition.

- 19 -

2.2.3 Star Topology

In a star topology, the hop count is always 1 and every transaction goes through the

central crossbar switch. The following equation represents the Epkt of a flat star

topology.

( ) ( ) QueuePULinkSFARBQueueFSpkt ELNENENEEE +×−×+⋅+⋅+×= 21 (5)

The central switch has a number of N I/O ports and the average distance between two

PEs via the central switch is 2−N . The energy of a hierarchical star is given as the

following equation.

)1()( ___ αα −⋅++⋅= SGlobalpkt

SLocalpkt

HSpkt EEEE (6)

The local/global energy, ESpkt_Local and ES

pkt_Global, can be obtained from (5), by

replacing N with N . In case of global-network, the LPE should be also replaced by

Fig. 2.6: Energy comparison of star topologies.

- 20 -

Fig. 2.6(a) shows the comparison of EFSpkt and EHS

pkt. When the traffic is less

localized and the N is less than a few tens, the flat-star topology shows higher energy

efficiency because of its less hop-count than the hierarchical-star topology. However,

as the traffic gets localized (α > 0.7), the hierarchical-star outperforms the flat-star

topology. Moreover the hierarchical-star shows very flat energy profile with the

increasing network size so that it is highly scalable in terms of the energy-cost. Fig.

2.6(b) shows the energy comparison in switches and links.

2.2.4 Point-to-Point Topology

Point-to-point topology has a dedicated link between each pair of PEs as shown in

Fig. 2.1(d). It provides the shortest link length without any intermediate switches

between a sender and a receiver, thus it shows lowest energy consumption among the

other topologies. However it suffers from the huge link area and large number of

input/output ports. (The area cost will be discussed in section 5.) The following

equation describes the Epkt of the flat point-to-point topology.

QueuePUPtPLink

MUXARBFPpkt

+⋅⎭⎬⎫

⎩⎨⎧

−⋅⎟⎠⎞

⎜⎝⎛+⋅⎟

⎠⎞

⎜⎝⎛⋅+

×−⋅+−⋅=

2)2()1(

4_ αα

Each PE has 1:(N-1) de-multiplexer and (N-1):1 multiplexer in its input/output port,

respectively. The average link length of a point-to-point topology is the same as that of

a mesh topology which is the shortest Manhattan distance.

- 21 -

2.2.5 Heterogeneous Topologies

In the previous sections, I have analyzed basic topologies such as bus, mesh, star

and point-to-point topology and some hierarchical and homogeneous topologies such

as hierarchical bus and hierarchical star topologies. It was found that the hierarchical

topologies perform the better energy efficiency and scalability than the flat topologies

do. Therefore it is worthwhile to examine the energy efficiency of other hierarchical

and heterogeneous topologies, for example, a local-star global-mesh or a local-bus

global-star topology. In order to consider such hierarchical and heterogeneous

topologies, I evaluate the basic topologies in two hierarchical domains; a local (intra-

cluster) and a global (inter-cluster) domain.

Fig. 2.7: Energy cost in a (a) local and (b) global network.

Fig. 2.7 shows the comparison of the energy efficiency in a local and global

network separately. In a local network, the energy cost on a link is much lower than

- 22 -

that in a switch because of the shorter distance between communicating nodes. Since a

mesh topology has larger number of hops than others, it shows the highest energy cost

than the others. On the other side, in a global network, the energy cost on a link

becomes significant. As a result the energy cost of the bus topology, which has the

longest wires, gets worst. Considering both of the local and global networks, the point-

to-point topology is the most energy efficient and the star topology is the next.

All of the topologies are compared under various traffic conditions as shown in Fig.

2.8. In any traffic condition, the point-to-point topologies show the best energy

efficiency. If the point-to-point topologies can not be adopted due to its infeasibility,

the performance of star topologies is the best among the others. If N is fixed to 36 (See

Fig. 2.8(c)), for instance, the flat star is the best for less localized traffic while the

hierarchical star (L-star G-star) is the best for more localized traffic (α > 0.5). The

mesh consumes 30~80% more energy than the hierarchical star does. The mesh

outperforms the hierarchical bus about 10~20% at less localized traffic, but the

hierarchical bus shows much better energy efficiency than the mesh at highly localized

traffic.

- 23 -

Fig. 2.8: The energy cost comparison under (a) uniform, (b) localized and (c) varying traffic condition.

- 24 -

2.3 Area Exploration

I also analyze the area cost of on-chip networks which is one of the most important

practical issues, but it has not been considered in prior works.

Fig. 2.9(a), (b) shows the area cost of the basic topologies in a local and global

network domain. As we expected, the area cost of the point-to-point topology is

skyrocketing so that it is not feasible to be implemented on a chip. A bus topology

shows the lowest area cost. Interestingly, the star topology occupies almost the same

area as the bus in a local network and consumes slightly lower area than the mesh does

in a global network. Fig. 2.9(c) shows the area comparison of all of the topologies.

The hierarchical bus topology shows the lowest area cost as we expected. However,

local-bus global-star/mesh and local-star global-star/mesh topologies also occupies as

little area as the hierarchical bus does. This is because the area of total network

strongly depends on local networks rather than a global network.

- 25 -

4 5 6 7 8 9 10

Local Area[mm2]BusStarMeshPoint-Point

4 5 6 7 8 9 10

Global Area[mm2]

60Total Network Area

L-PtP G-PtPL-PtP G-Star

L-Mesh G-Star

Flat Star

Flat Mesh

L-Star G-MeshL-Star G-StarL-Bus G-MeshL-Bus G-StarL-Bus G-Bus

N20 30 40 50 60 70 80 90 1000

Fig. 2.9: Area cost in a (a) local, (b) global network and (c) hierarchical topologies.

- 26 -

2.4 Performance Exploration

The communication throughput, latency and the maximum achievable frequency

strongly depends on the topology of the network. In this section, I look into the

performance comparison among the hierarchical topologies and basic topologies as

2.4.1 Average Throughput of a PE

The throughput degradation of the packet switched (i.e. pipelined) network primarily

occurs because different flows2 share an intermediate link as illustrated in Fig. 2.10.

Furthermore, if the switch uses FIFO input queues, the head-of-line (HOL) blocking

phenomena limit the maximum achievable throughput on a link further [34]. In this

subsection, an average injection throughput of a source PE is analyzed by accounting

the two throughput degradation in each topology.

Fig. 2.10: Throughput degradation due to the link sharing and HOL-blocking.

2 A flow means a unique routing path from a source node to a destination node.

- 27 -

A. Point-to-Point Topology

Since the point-to-point topology has a dedicated link for each flow, there are not

shared links and shared queues either. Thus there is no throughput degradation on the

point-to-point network.

B. Bus Topology

Since all the PEs share a single link (bus), the bus bandwidth is equally divided to

each PE. Thus the average injection-throughput of a PE is reduced to 1/(the number of

C. Star Topology

In the star topology, there is no shared link. However there exists HOL blocking when

a PE is enable to issue multiple outstanding addresses3. The throughput degradation

due to HOL blocking is derived in many literatures under uniform i.i.d. traffic [34].

The throughput decreases as the number of PE increases and is saturated to 0.58 when

the number of PEs gets larger than sixteen.

D. Mesh Topology

The throughput of the mesh topology is more complicated to be derived because there

are multiple shared links and the throughput strongly depends on the routing algorithm.

For the simple approach, I first derived the maximum number of flows on an 3 The ability to issue multiple outstanding addresses means that PE can issue new transaction addresses without waiting for earlier transactions to complete. Because it enables parallel processing of transactions, this feature can improve system performance [13].

- 28 -

intermediate link (NFOIL) and on a PE-link (NFOPL) as illustrated in Fig. 2.11. I applied

a dimension-order routing algorithm [36].

)1(61 2 +nn

12 −n

Fig. 2.11: The number of flows on a link in 3 x 3 mesh topology.

Since NFOIL and NFOPL of flows share an intermediate link and a PE-link, respectively,

the aggregated throughput of the flows could not exceed the link throughput. If the

throughput of a flow is denoted as ρflow and the maximum throughput of a link limited

by HOL blocking is denoted as ρlink_HOL, the following equation should be satisfied.

( ) HOLlinkflowFODLFOIL NNMax _, ρρ ≤× (8)

Therefore the average injection-throughput of a PE, ρPE, can be derived as the

following equation.

HOLlinkflowPE nnn _2

2 1,)1(6min)1( ρρρ ×⎟⎠⎞

⎜⎝⎛ −

≤×−= (9)

where the ρlink_HOL is a function of the switch size, i.e. the number of input ports of a

- 29 -

switch [34].

Fig. 2.12 shows the comparison of PE-injection-throughput versus the number of PEs

in the basic topologies under i.i.d. uniform traffic condition. The star and mesh

topology shows competitive throughput meanwhile the point-to-point and bus

topology performs the best and the worst, respectively. When the number of PEs is

less than 20, the mesh shows better throughput than a star does because of less HOL

blocking probability of smaller size switches in the mesh. However, as the network

size gets larger, the intensive sharing of intermediate links degrades the network

throughput.

In the on-chip communications, the average injection-throughput of a PE may not be

higher than 0.5 in general because the PE needs internal processing latencies out of

load/store instructions4. By this estimation, the star guarantees good bandwidth in any

size of network. Although the mesh topology shows throughput degradation as the

network size gets bigger than few tens, the throughput seems still acceptable to

general systems.

4 The injection-throughput of 0.5 means that the PE operates load/store instruction in every second clock cycle.

- 30 -

0.10.2

0.30.4

0.60.7

0 10 20 30 40 50 60 70 80 90 100Number of PEs

Limited by destination link

Limited by intermediate link

Fig. 2.12: Injection-throughput of a PE in basic topologies.

E. Hierarchical Topologies

The hierarchical topology is consists of local network and global network as shown in

Fig. 2.13.

PEρPEρ

HOLlink _ρ≤

nPE ×⋅− ρα )1(

Fig. 2.13: Throughput in a hierarchical topology.

- 31 -

E-1. Bus as a local network

The local bus network should deal with the aggregated throughput from each PE and

the gateway. Therefore the following equation should be satisfied.

ρραρ

−≤∴

=≤×⋅−+×

BUSPEPE

E-2. Star as a local network

The throughput on a gateway-link and a PE-link in a local star network is limited by

the HOL blocking.

⎟⎟⎠

⎞⎜⎜⎝

⎛−

×≤∴

≤×⋅−

αρρ

ρρα

HOLlinkPE

E-3. Mesh as a local network

The inter-cluster flows (NFOIL_GW) and intra-cluster flows (NFOIL_LOCAL) share an

intermediate link as illustrated in Fig. 2.13. The maximum NFOIL_GW and NFOIL_LOCAL are

derived as:

nnN GWFOIL ⋅⎥⎥

⎤⎢⎢

2_ )1(61

_ += nnN LOCALFOIL (12)

Therefore the throughput on the intermediate link is calculated as the sum of intra-

- 32 -

cluster flows and inter-cluster flows as the following equation. The link throughput is

limited by the HOL blocking.

( ) HOLlinkPEGWFOILflowGWFOILLOCALFOIL NNN ____ )1( ραραρ ≤−××+××− (13)

HOLlinkPE

HOLlinkGWFOILLOCALFOIL

flowPE

___ )1)1((

)1()1(

ραα

×⋅−−×+⋅

−≤×−=∴

E-4. Bus as a global network

The aggregated throughput of inter-cluster flows on a global bus is less than or equal

)1(11)1(

nn PEPE ⋅−

≤∴≤×⋅−α

ρρα (15)

Gateway

nPE ×⋅− ρα )1(

Global bus

a number of n clusters

2)1( nPE ×⋅− ρα

Fig. 2.14: Throughput in a global bus topology.

- 33 -

E-5. Star as a global network

The analogy is the same as the local star network.

⎟⎟⎠

⎞⎜⎜⎝

⎛−

×≤∴

≤×⋅−

αρρ

ρρα

HOLlinkPE

nHOLlinkPE

In the global star topology, there is only a single link between the local and global

switches thus it limits the overall throughput. To cope with it, I can double the inter-

cluster link like a fat-tree as shown in Fig. 2.15. Then, the throughput figure will be

given by the following equation.

⎟⎟⎠

⎞⎜⎜⎝

⎛−

×≤∴

≤×⋅−

αρρ

ρρα

HOLlinkPE

nHOLlinkPE

Where the link throughput,HOLlink _

'ρ , is lessen from the HOLlink _

ρ from equation

(16) because the size of the global switch increases by double.

- 34 -

Fig. 2.15: hierarchical star topology with double link between local and global networks.

E-6. Mesh as a global network

Do the same analogy as (8) by only replacing the ρflow with ρflow(1-α)n.

( ) HOLlinkflowGLOBALFODLGLOBALFOIL nNNMax ___ )1(, ραρ ≤⋅−⋅× (18)

where NFOIL_GLOBAL = )1(61

+nn , NFOPL_GLOBAL = 1−n (19)

Thus, ( )GLOBALFODLGLOBALFOIL

HOLlinkflowPE NNMaxn

,)1( ×−⋅

×≤×=

ρρ (20)

Fig. 2.16(a) shows the PE-injection-throughput comparison in a local network where

the number of PEs is limited up to 10. The local mesh topology provides the best

throughput. Meanwhile, the throughput in star topology gets reduced as the size of

network increases because of the limited bandwidth on the gateway-link between local

and global network. By doubling the gateway-link, the throughput of the local star

- 35 -

becomes similar as that of local mesh. As the traffic gets localized, the local network

throughput increases as shown in Fig. 2.16(b). Under the highly localized traffic, mesh

and star provide similar throughput figure.

Fig. 2.16: PE-injection-throughput in a local network.

Fig. 2.17(a) shows the PE-injection-throughput comparison in a global network where

the number of clusters (n) is limited up to 10 and the number of PEs in a cluster is n .

When the number of clusters is less than or equal to 5, the star and mesh topology

don’t suffer from the throughput degradation. However, as the network size gets

bigger, the throughput of a mesh and star topology is degraded by excessive link

sharing. By doubling the global link like the star2, the degradation disappears.

- 36 -

Fig. 2.17: Throughput in a global network.

10 20 30 40 50 60 70 80 90 100

lbgs lbgm

lsgs lsgmlmgs

lmgs2 lsgs2

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

lbgs lbgmlbgs2

lsgslsgm

lmgs2lsgs2

Flat Mesh

Number of PEs (n2) Locality factor ( )(a) (b)

= 0.8 n2 = 36

Fig. 2.18: Throughput in hierarchical topologies.5

Fig. 2.18 shows the overall throughput of the hierarchical and heterogeneous

topologies. First, the bus topology in local or global network limits the overall

throughput severely as we expected. Interestingly, the flat mesh topology performs the

best in terms of the network throughput. Thought the lmgs2 and lsgs2 shows lower

5 Notation of the items: l (local), g (global), m (mesh), s (star), s2 (star with double global links), b (bus)

- 37 -

throughput under less-localized traffic, they are still acceptable in the general systems.

By using the proposed analytical method, we can estimate the throughput figures of

hierarchical topologies and understand the trends of each topology.

2.4.2 Average Packet Traversal Latency

The latency through the network can be derived by accumulating the hop delay and

link delay.

LinkAvgHopAvgLatency TLTHT ⋅+⋅=

where SFARBQueueSyncHop TTTTT +++= (21)

f a sw

Fig. 2.19: Latency in a hop-switch and the maximum achievable clock frequency.

Fig. 2.19 shows latency in a hop-switch and its maximum achievable clock frequency

versus the number of I/O ports in a switch up to 10. The queue-waiting-latency (Tqueue)

- 38 -

is caused output-port conflict and the HOL blocking in a FIFO queue. The clock

frequency of a switch is determined by the arbitration logic speed. It saturates around

450MHz even though the switch has 100 I/O ports when the arbiter is implemented by

Mux-Tree circuits [11]. With the switch-latency characteristics, I evaluate the latency

of various hierarchical and heterogeneous topologies as shown in Fig. 2.20.

The average packet traversal latency figures are similar as the average packet traversal

energy in Fig. 2.8 since the latency is also accumulated on each hop and link like the

energy consumption does. The evaluation results show that the hierarchical star shows

the lowest latency and lsgm is the next. The flat mesh topology experience almost

twice latency than the hierarchical topology does. The packet latency in a flat star

topology is skyrocketing due to its huge switching-fabric-latency.

Fig. 2.20: Average packet traversal latency.

- 39 -

2.5 Discussion

2.5.1 Performance (Throughput and Latency)

The best throughput is provided by the flat mesh topology. But the flat mesh

topology suffers from its long latency (twice than H-star) and larger area cost (three

times than H-star). Furthermore, when the traffic gets localized, the hierarchical and

heterogeneous topologies perform the competitive throughput with the flat mesh

topology. In many embedded systems, the required average injection-throughput of a

PE is not so high. For example, if a PE is a general purpose microprocessor with a

cache memory, the miss rate is less than 0.01 in the conventional benchmark

applications [37]. Thus the PE-injection-throughput (or the cache-block replacing

traffic) will be less than 0.1. With this throughput requirement, most of the topologies

except the bus topologies meet the throughput specification.

On the other side, the network latency could be more critical issue in real-time

applications. Moreover latency reduction directly improves the system performance.

In the aspect of the network latency, the hierarchical star topology performs the best

than others because it has less hop counts.

2.5.2 Cost (Energy and Area)

I have analyzed the energy and area cost of the flat and hierarchical topologies with

various network sizes and traffic patterns. The energy cost of the point-to-point

- 40 -

topologies is the lowest, but its area cost is too high to be implemented. Although the

hierarchical bus shows the lowest area cost, its energy cost increases more rapidly

with the increasing network size than other hierarchical topologies. According to our

analysis, the hierarchical or multilayer bus can not be used when the number of

integrated cores is larger than 25. The flat mesh topology shows better energy

efficiency than the hierarchical bus only when the traffic is uniformly distributed.

However, as the traffic gets localized, the energy cost of the mesh doesn’t scale down

as much as other hierarchical topologies do. Moreover, the area cost of the mesh is

usually three times larger than that of other hierarchical topologies. If you don’t take

the point-to-point topologies into account, a flat star topology shows the best energy

efficiency when the traffic is uniform and the network size is less than 80. However it

shows worse energy efficiency as the traffic is localized and the network size gets

bigger.

The hierarchical star (local-star global-star) topology is the most cost-efficient and

scalable topology for the heterogeneous systems where the traffic is localized. The

energy cost is the lowest among hierarchical topologies and the area cost is also

comparable with the hierarchical bus.

- 41 -

20 30 40 50 60 70 80 90 1000

4.0Energy2 x Area (uniform traffic)

N20 30 40 50 60 70 80 90 100

Energy2 x Area (a = 0.80)

(a) (b)

L-Bus G-Mesh

Multilayer Bus

Flat MeshFlat StarFlat PtP

L-Star G-StarL-Star G-Mesh

L-Bus G-StarL-Mesh G-Star

L-Mesh G-StarFlat MeshL-PtP G-PtP

Fig. 2.21: Energy2 and area product

Fig. 2.21 shows the energy2 and area product as a single cost-efficiency metric which

is normalized to the hierarchical bus. Under the uniform traffic and also localized

traffic, the hierarchical star shows the best cost-efficiency. As a summary, Fig. 2.22

shows the cost distribution of selected topologies with traffic and network size

variations. The height and the slope of the distributed area represent the energy

variation according to the traffic variation and the network size, respectively. The

hierarchical star topology shows less energy variation than the hierarchical bus and the

mesh. It also shows less area variation than the mesh and the star topology. Finally, the

- 42 -

hierarchical star topology is located in the left lower corner side which means that it is

the most cost-effective topology among others. In order to increase the throughput of

the hierarchical star topology, the global link can be doubled as shown in the Fig. 2.15.

In this case, the area overhead is 20% which is not significant compared with the

throughput enhancement.

Fig. 2.22: Energy and Area Cost distributions with traffic and the network size variations.

- 43 -

2.6 Model Verification with Case-Study Examples

For the validation of the analytical performance and cost models proposed in this

paper, I compared the analytic results with a layout-aware simulation results in two

topologies as shown in Fig. 2.23 [2], [38]. Both examples contain thirty PEs such as

ten ARM 7 cores, ten private memories, five traffic generator and five share memories.

In the first example, five AMBA AHB buses are connected any of five shared

memories by the crossbar switch. This topology can be roughly approximated to local-

bus global-star among our analytic topologies. The second example is 5x3 mesh

topology however two PEs are connected to a single switch unlike a conventional

mesh. On top of the two platforms, a multimedia benchmark is run on the ARM

processors with heavy synchronization activity since it modeled a producer/consumer

pipeline of multimedia processing. The interconnection fabrics are synthesized with

0.13µm technology libraries [38].

- 44 -

5x5 Crossbar sw

Fig. 2.23: Validation Examples and those layout views in 0.13µm technology

I compared the results from analytical models and the simulated results as shown in

Fig. 2.24. The energy and latency figures are normalized to the multilayer bus. Since

the simulation model uses different packet format, different flit size and flow control

scheme, the absolute numbers are different. However, in this work, the relative trends

between various topologies are more emphasized. Even though the two examples are

- 45 -

not sufficient to see the performance and cost trends, the plots in the figure shows

similar trends.

Fig. 2.24: Comparison between the proposed analytical model and the simulation results

- 46 -

2.7 Summary

In this chapter, a throughput/latency/energy/area-oriented topology exploration is

performed for Networks-on-Chip design. The evaluated topology candidates include

not only flat topologies such as a bus, mesh, star and point-to-point but also sixteen

hierarchical and heterogeneous topologies. The evaluation method uses technology-

independent analytical models with implementation-based physical parameters.

This work reveals the performance and cost relationship in a hierarchical way

according to the number of integrated cores and various traffic patterns. As a result,

the hierarchical bus is no longer cost-optimal structure as the network interconnects

more than 25 nodes. The mesh topology is also not a cost-efficient compared with

hierarchical topologies. The hierarchical topologies perform better than the flat

topologies as the traffic gets more realistic with locality properties. Among them the

hierarchical star topology shows the best cost-efficiency and also the lowest latency.

This work provides the NoC designers with the fundamental understanding on the

performance and energy/area cost trends of not only the conventional flat topologies

but also the hierarchical and heterogeneous topologies by a straightforward and

scalable analytical method.

- 47 -

Chapter 3 NoC Architecture

For the design of Networks-on-Chip, there are lots of things to be decided such as

communication-protocol, network-topology, switching-style, clock-synchronization

method, signaling scheme and so on. In this section, I will explain what I decided and

why I made such decision at each design-stage with an important basic idea, low-

power consumption.

- 48 -

3.1 Circuit Switching and Packet Switching

To put it bluntly, intra-cluster star network does circuit switching. In circuit

switching, a physical path from the source to the destination is reserved prior to the

transmission of the data. Inside a cluster, once a transmission is granted, then it is not

corrupted by other transmission and not stored in buffer, thus deterministic packet

delay is guaranteed if processing units are only synchronized. Such predictable and

deterministic service is crucial requirement for software programming and QoS

(Quality of Service) guarantee is more important for real-time applications. Moreover,

due to the circuit switching, area and power consuming buffer is not needed in the

intra-cluster networks. But circuit construction and destruction cause communication

latency in multi-hop networks. Fortunately, intra-cluster network is star-topology, i.e. a

single hop count. Thus the latency overhead is the minimum. When there is no

transaction on the reserved circuit, throughput on the channel can be severely

degraded. To prevent such a defect of the circuit-switched network, the reserved

circuit is automatically disconnected after each packet transmission. Therefore other

processing unit can use the channel without waiting time. In order to transmit multiple

packets, burst transaction protocol is provided. During the burst transaction, the

channel is not destructed and not corrupted by other transactions. This protocol issue

will be discussed in section 2.5 in more detail.

- 49 -

Inter-cluster traffic has longer end-to-end latency and the amount of traffic is much

less than that of intra-cluster traffic. Peripheral cluster operates much slower than main

cluster. If the inter-cluster network does circuit switching, the inter-cluster channel

will show very low utilization due to the slow response of peripheral units. Therefore,

packet switching, i.e. store-and-forward switching, is more appropriate for inter-

cluster networks. The main advantage of packet-switching is that it permits statistical

multiplexing on the channel. That is, the packets from many different sources can

share the channel, allowing for very efficient use of the fixed capacity. There are

packet buffer spaces at the both ends of the inter-cluster channel. The shift-register

type buffer of single packet capacity takes 200x140µm2 area and consumes 6.5mW at

1.6GHz frequency.

- 50 -

3.2 Synchronization

A state-of-the-art or near-future SoC can be seen as a heterogeneous multi-

processing system, with multiple timing references, because of the difficulty of global

synchronization as well as PU-independent power management skill, i.e. dynamic

frequency scaling. Fig. 3.1 shows the proposed synchronization structure in a NoC.

Each processing unit operates with its own clock, but it communicates with a unique

clock, CLKNET. Network Interface (NI) changes the timing reference from CKn to

CLKNET and vice versa. It is possible to synchronize the CLKNET inside a cluster

because the physical area of a cluster can be within a single clock domain. But, it is

very difficult to synchronize the CLKNET for all of clusters. Therefore inter-cluster

communications become mesochronous situation – same frequency but different skew

– and synchronization scheme is needed for reliable transmission. In this

implementation, packet buffer between switches play a role of the FIFO-

synchronization circuits naturally. I use also source-synchronous scheme where strobe

signal goes along with packet data. The strobe signal is used as timing reference to

latch the packet data at the receiver end.

- 51 -

CK1 CK3

CK1 CK2 CK3

CLKNET

CLKNET is not sych.between clusters

intra-clusterCircuit

switchinginter-cluster

Packetswitching

Cluster A Cluster B

STROBEPacket HADDRDATA

Fig. 3.1: Synchronization structure in a NoC.

- 52 -

3.3 Serialization

On-chip serial communications has many advantages over multi-bit parallel

communications in many challengeable issues such as signal skew, crosstalk, area cost,

writing difficulty [9], [13]. In this implementation, a packet of 80bits – 16bit-header,

32bit-address and 32bit-data – is serialized into 8bits by SERDES circuits inside the

NI. The serialization method is different from the previous work [9] as shown in Fig.

3.2. In the previous implementation, header, address and data have their own channels

of 4bits, 8bits and 8bits respectively with 800MHz frequency. In this implementation,

however, they are multiplexed onto 8bits channel by time-sharing manner with

1.6GHz frequency. Therefore the area of the network is further reduced by 1/4 due to

the smaller channel width. Speaking of the network bandwidth, our serialization style

has advantage over the previous style. For example, in case of read operation, the

request packet doesn’t have to contain data field and response packet doesn’t need

address field either. While the previous serialization method always shows fixed

packet length, this implementation shows shorter and variable packet length by

removing the unnecessary field. As a result, utilization of a shared channel can further

increase. In order to indicate the packet length at line-speed, sideband End-of-Packet

(EOP) signal is need.

- 53 -

Packet length in timeC

Addr.Data Hd

This work

Previous work

Fig. 3.2: Two serialization methods.

- 54 -

3.4 NoC Protocol

Fig.2.5.1 shows a NoC protocol such as packet format and packet transactions. The

NoC protocol supports burst packet operations for large data transactions with length

of 2, 4, or 8 packets. The burst transaction is not interleaved with other flow packets.

Burst read request packet contains only base address but its response contains

successive data from the base address by incrementing 4 as shown in Fig. 3.3(b). The

first packet has full information but other following packets have the minimum

information required for routing in their compact header. Burst write request of length

4 send only the base address at the first packet as shown in Fig. 3.3(c). The other

following packets have only compact header and data field. By using the burst

transaction with compact packets, total transaction time is reduced by half compared

with multiple single packet transactions.

Dest. ID

Source ID

Address (32b)

Data (32b)

C: Compact packet indicatorE: Bus encoding indicatorDest. ID: Destination Network IDSource ID: Source Network IDPr: Packet priorityW: Write/ReadRv: ReservedA: Address field enableD: Data field enableAC: Acknowledge requestBL: Burst length (1, 2, 4, 8)

Header Information HA

H D D D DMaster Slave

Compact Packet

ACMaster Slave

Acknowledge Packet

BL=4AC=1

(b) Burst read transaction with BL=4

(c) Burst write transaction with AC request(a) Packet format

Compact Header

A+4 A+8 A+12A

Fig. 3.3: Packet format.

- 55 -

A switch arbiter peeks at the destination ID in the header field and searches the

output port number in a look-up table. 1-bit priority information in the header enables

differentiated scheduling among packets. The priority can be determined by software,

Operating System (OS) or application programs. For more reliable transaction

acknowledgement request is possible as shown in Fig. 3.3(c). The protocol provides 1-

bit sideband back-pressure signal for congestion control in the networks. The back-

pressure signal is asserted when a packet buffer exceeds predetermined threshold or

destination PU cannot service temporarily.

The NoC protocol, named BONE, was upgraded from previous versions. The

completeness and reliability of the protocol functionality was verified by high-level C-

based simulator, BONE-SIM [24].

- 56 -

3.5 Queuing Buffer and Memory Design

Queuing buffer is used in the input port of a switch and in the network

interface too. The queuing buffer consumes the most area and power among

composing building blocks in the on-chip network. The buffer circuits can be designed

by two different memory units, either registers (flip-flops) or static-RAM cells. Fig.

3.4 shows four different register designs: (a) a conventional Shift-Register, (b) Push-In

Shift-Out register, (c) Push-In Bus-Out register and (d) Push-In Mux-Out register. A

SRAM style design is also shown in Fig. 3.5.

(a) Conventional Shift Register

sh sh sh shshRreq

Packet Out

Packet In

IntermediateEmpty Bubble

(b) Push-In Shift-Out Register

en en en enen

Packet Out

Packet In

New Arrival

controlleroff off on0 1 1on onWreq

D Q D Q D Q D Q D Q

DQ D Q D Q D Q D Q

(c) Push-In Bus-Out Register

en en en enen

Packet In New Arrival

controlleroff offon off offWreq

QD D D DD

Packet Out

(d) Push-In Mux-Out Register

en en en enen

Packet In New Arrival

controlleroff off on off offWreq

DPacket

Read Ptr.

Q Q Q Q

FirstIn First

0 0 0 1 0

FirstIn

Figure 3.4: Register designs for queuing buffer.

- 57 -

Figure 3.5: Dual-port SRAM design for queuing buffer.

In a conventional Shift-Register, intermediate empty cells can exist when the

packet in/out rates are different temporally in any case. Shifting all the registers at

every packet-out consumes huge amount of power. Furthermore the min. latency in a

queue is as long as the physical queue length rather than the backlog. Although this

design is the simplest, it is not desirable to implement on a chip due to its longer

latency and unnecessary power dissipation.

To remove the intermediate empty bubble, the arrival packet can be stored at

the front empty place rather than the tail of a queue. This input style is called as ‘Push-

In’ as illustrated in Fig. 3.4(b). It can remove unnecessary latency and power

consumption caused by the empty bubble. Only the occupied register cells are enabled.

However, the shifting register style still consumes unnecessary power by shifting all of

the occupied cells at every output packet. To avoid the shifting operation, the outputs

- 58 -

of all registers are tied to a shared output bus line via tri-state buffers as shown if Fig.

3.4(c). The register holding the first-in packet is connected to the output bus by

turning-on the tri-state buffer. In this design, only a cell, in which a new arrived packet

is stored, is enabled. As the queuing capacity increases, the capacitance of the shared

bus wire increases as well because of the parasitic capacitance of tri-state buffers, and

the delay and power consumption are also enlarged. To eliminate this effect, output

multiplexers can be used as shown in Fig. 3.4(d).

These register-based implementations have a definite limitation in their capacity

because of the area and power constraints. As the queuing capacity rises up to a dozen

of packets, the register-based implementation is not good in both respects of area and

power [29]. Therefore the queuing buffer should be designed based-on a dual-port

SRAM-cell for large capacity queuing as shown in Fig. 3.5. The figure shows the cell

circuit and its layout also. A SRAM cell occupies only a tenth of a register (Flip-flop)

area. This area ratio will not be largely varied with technology.

- 59 -

Chapter 4 Low-Power Techniques In this implementation, I proposed various low-power techniques in physical layer,

data-link layer, network layer and transport layer. In this section, those techniques will

be presented briefly.

- 60 -

4.1 Low-swing signaling

The global link connecting clusters are usually a few millimeters long. Therefore, it

suffers from its longer latency and higher power consumption than a local link does,

making cross-chip communication increasingly expensive. Low-swing signaling can

alleviate the energy consumption significantly and overdriving signaling improves its

delay [14].

4.1.1 Driver circuits

There is a limited degree of freedom in the design of drivers. If a reduced supply

voltage, VDDL, is used, the power dissipation is reduced to (VDDL/VDD)2 [Fig. 4.1(a) and

(b)]. The VDDL can be provided from an off-chip or generated on a chip by DC-DC

converting from the main supply, VDD. Unless the additional power supply is applied,

the low-swing can be obtained by exploiting a transistor Vth drop or pulse enabled

driving [Fig. 4.1(c) and (d)]. However, these designs are susceptible to process

variation and noise. In the pulse-controlled driver, the Vswing is determined by not only

the pulse duration but also the Cw which is hard to be estimated in design stage. The

most widely used driver is the type of (b) [14], [11], [32]. By using a NMOS pull-up

transistor instead of a PMOS transistor, faster rising-time on the output wire is

obtained with smaller transistor size

- 61 -

Fig. 4.1: Driver Circuits for Low-Swing Signaling.

4.1.2 Receiver circuits

There are two design options in receivers with regard to the noise immunity; a single-

ended level-converter [Moisiadis00], [40], [41] or differential amplifier [14], [42],

[43], [11]. Differential signaling is more immune to noise due to its high common-

mode rejection, allowing a further reduction in the swing voltage [32], [14]. Although

the differential signaling requires double wires, the wiring congestion can be

alleviated by using on-chip serial links [44]. Zhang evaluated many different receivers

including a pseudo-differential amplifier in respect of energy, delay, swing and SNR

- 62 -

Fig. 4.2: Clocked differential sense-amplifier.

In the on-chip interconnection network, the receiver circuits should be light-weight

and occupy as small area as possible because it is used abundantly in most of the

network interfaces. Fig. 4.2 shows an example of a simple clocked sense-amplifier

with a three-stage CMOS inverter chain. PMOS transistors are used as the input gates

in order to receive a low common-mode input signal. The sizes of the input gates and

their bias currents are chosen to amplify the desired low-swing differential input.

(WP/LP = 3µm/0.18µm, Vswing = 0.2V, VDD = 1.6V, Area = 10 x 15µm2 [11])

4.1.3 Static and Dynamic Wires

There are two kinds of wires; static and dynamic wires. To speed up the response of

- 63 -

wires, it is precharged to VDDL through PMOS transistors before each bit transition.

After the precharging phase, the wire is conditionally discharged by the pull-down

transistor of a driver. This dynamic signaling is used for multidrop buses having large

fan-in and large fan-out. In the network on chip, however, the link is point-to-point so

that there is only single fan-in and fan-out. Therefore dynamic wire is not a good

candidate for on-chip networks especially when the wire has long latency.

Furthermore it is susceptible to noise.

4.1.4 Implemented Low-swing Link

Transmitter

PACKET_OUTRXTX

VSWINGVDD

10/0.18

PredriverNMOS Driver

10/0.18

Clocked Sense Amp.

inout3/0.18

nSTB STB3/0.18

Clock Restore Circuit

inverter amp.

STROBE_OUT

inverter amp.

5.2mm wires in zigzags

VDD VDD

STROBE_IN

PACKET_IN

VDD VSWING < VSWINGVDD

Fig. 4.3: Low-swing signaling and its transceiver circuits.

Figure 4.3 shows the implemented low-swing and differential signaling. Global

- 64 -

wires between main cluster and peripheral cluster are laid out in zigzags to emulate a

global link as long as 5.2mm without repeaters. Transmitter drives the wires using

VSWING, less than VDD, and receiver restore the swing to its normal voltage, VDD. The

driver uses N-fet for both pull-up and pull-down instead of an inverter to exploit their

lower linear resistance at small Vds. Because the gate-voltage of the pull-up NMOS

transistor is higher than its drain-voltage, Vth-drop is not observed on the wire. In

addition, since the size of NMOS transistor is smaller than equivalent PMOS transistor

by 2/5, the capacitive load of pre-driver is reduced.

A simple clocked sense amplifier followed by a 3-stage inverter chain performs

full logic amplification of low-swing. PMOS input gates are used in order to receive a

low common-mode input signal. The sizes of the input gates and bias current are

chosen to amplify as low as 200mV swing to 1.6V full-logic swing with small delay

penalty. A clock signal for the clocked sense amplifier is regenerated from a STB

signal by a clock restore circuit (CRC). In the CRC, an inverted input, nSTB, is fed to

the P1-gate in order to reduce the standby current. When standby (STB=0 or nSTB=1),

the gate voltage of P1 transistor increases as high as VSWING, thus bias-current

decreases. Due to this scheme, the standby bias current of the CRC becomes almost

- 65 -

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.10

Transmitter

Receiver

Driving voltage: VSWING (volts)

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.10.4

400Mbps800Mbps1.6Gbps

Driving voltage: VSWING (volts)(a) (b)

Optimal point

Fig. 4.4: (a) Energy consumption, (b) energy and delay product versus voltage

swing at various signal rates.

C. Svensson demonstrated the existence of an optimum voltage swing for the

lowest energy dissipation on long wire signaling [15]. In our implementation, the

VSWING scales to the most energy and performance efficient voltage swing at each

operating frequency of the on-chip network. To find out the optimum voltage swing at

which the energy and delay product is the smallest, I conducted post-layout

simulations with a precise capacitance and resistance wire model. A 5mm metal2 wire

of 0.5um width and 1.1um space has 330fF parasitic and 100fF coupling capacitance.

VSWING scales from 0.25V to 1.1V with 50mV step and signaling rates are 400Mbps,

800Mbps, and 1.6Gbps. See Fig. 4.4(a). The required energy on the transmitter to

create a certain voltage swing on the wires decreases linearly with decreasing swing,

whereas the energy to amplify this signal back to a normal logical swing increases

superlinearly with decreasing swing. The optimum VSWING exists due to such opposite

- 66 -

trends of transmitter and receiver.

According to the result, smaller energy is needed at higher signal rate. The reasons

are followings. At 1.6Gbps signal rate, the wire does not swing fully up to the driving

voltage, because the sum of rising and falling time is longer than the signal period.

Therefore, the energy needed on the transmitter decreases as the signaling rate

increases. At the receiver amplifier, constant bias current is consumed during the clock

is HIGH. Since the absolute time of clock = HIGH gets shorter as the signal rate

increases, the amplifier consumes less energy at higher frequency.

Fig. 4.4(b) shows energy and delay product versus VSWING. The delay from a

transmitter to a receiver is about 0.9nsec and its variations according to the VSWING or

signal rates are as small as ± 40psec. As shown in the figure, the optimal swing

voltage is 0.45V, 0.40V, and 0.30V at 400Mbps, 800Mbps and 1.6Gbps signal rates,

respectively. At each operating mode, the driving voltage scales to the optimal voltage

obtained above. Due to the low-swing signaling, the power dissipation on the global

link is reduced to 1/3 of that on a full-swing repeated link. Plus, there are no area-

consuming repeaters on the wires.

- 67 -

4.2 Mux-Tree based Round-Robin Scheduler

4.2.1 Switch Scheduler

To arbitrate the output conflicts, a scheduler is used on each output port. The

arbitration scheduling adds to the latency, power and area of the switch design. The

latency of the arbiter becomes larger than that of switch fabric as the switch size gets

bigger than 16x16 [44]. Furthermore the area cost is not ignorable when you use a

serialized link [11]. As shown in Fig. 4.5, the scheduler occupies similar area like the

switch fabric when the phit width of a port is 10-bits. Therefore the scheduler design is

important as much as the switch fabric.

- 68 -

Scheduler(70x80um2)

6x6 cross-pointswitch fabric

(220x220um2)

Figure 4.5: A 6x6 Switch Layout (Phit with of a link: 10-bits).

Round-robin scheduling algorithm is most widely used in the on-chip network [11]

because of its fairness and no-starvation properties. The round-robin scheduler can be

implemented by using two priority encoders [17] or Mux-Tree-connected logic [46]. A

Pseudo-LRU algorithm and its implementation were also proposed for lower area and

lower latency than those of the round-robin algorithm [47], [48].

- 69 -

4.2.2 Mux-Tree based Round-Robin Scheduler

grant<1:0>

(a) Mux-Tree based implementation (b) Tiny Arbiter (TA)

thermoencoder

grant<2:0>

grant<0>

REQ 76

U.token

L.tokenQ.req

UorLU.req

Q.token

pointernext to pointer

RQ TK RQ TK0 X 0 X1 X 0 X

1 1 1 01 0 1 1

0 X 1 X

1 1 1 11 0 1 0

Upper Lower UorL

U=1,L=0

Fig. 4.6: Mux-Tree based Round-Robin Scheduler: (a) block diagram, (b) Tiny Arbiter.

A scheduler (or arbiter) is needed in a crossbar switch when more than two input

packets from different input ports are destined for the same output port at the same

time. Among a number of scheduling algorithms, a round-robin algorithm is most

widely used in ATM switches and on-chip networks due to its fairness and lightness

[16], [17]. There are many ways on how to implement the round-robin algorithm [17]-

[19], [16]. Fig. 4.6 shows a Mux-Tree based implementation whose structure is

highly-modulated and scalable. A scheduling latency is O(log n) and required

resources are O(n), where n is the number of ports in a crossbar switch.

The round-robin scheduler has a rotating pointer that indicates a recently granted

- 70 -

port. A port next to the pointer has the highest priority to be granted. For example,

request vector <7:0> = 01100010 where underline means a position of the pointer.

Then, port <4> has the highest priority and the lower group of port <4:0> has higher

priority than upper group of port <7:5>. This information is given by a thermo-

encoder whose output becomes token <7:0> = 00011111. Therefore, port <4:0> have

their tokens. A request from a port having a token ‘1’ acquires higher priority than who

has no token. These request and token vectors are inputs of the binary Mux-Tree

which consists of Tiny Arbiters (TA) at each node. Each TA selects one of two ports,

upper one lower one, based on a table shown in Fig. 4.6(b). When both of two requests

have tokens, TA selects upper port because the pointer rotates in decreasing order.

UorL, Q.req and Q.token bits generated at each TA propagate and inputted to its parent

node. Then one of two children’s UorL bits is selected by 2:1 MUX based on their

parent’s UorL bit. Then, the selected child-UorL bit and its parent-UorL bit are

concatenated and propagate again. By doing so successively up to root node, the

granted port number is determined finally.

The Mux-Tree based implementation is compared with four other designs such as

EXH, SHFT_ENC, RIPPLE, and DUAL_SPE presented in [17]. Fig. 4.7 shows

comparison of power consumption and scheduling delay simulated with 8input ports

in 0.18um process technology. This work, Mux-Tree, performs the minimum power

- 71 -

and delay product; 136uW and 1.05nsec delay at 100MHz and offered load of 50%.

This work also requires the minimum number of transistors, i.e. area, except RIPPLE

design as shown in Table 4.1.

Dual_SPE

This work

SHFT_ENC

RIPPLE

0 1 2 3 4

Delay [nsec]

Fig. 4.7: Power and delay comparison with other round-robin implementations [17].

TABLE 4.1 COMPARISON OF THE NUMBER OF REQUIRED TRANSISTORS

RIPPLE

SHFT_ENC

DUAL_SPE

This work

8 ports 16 ports

- 72 -

4.3 Crossbar Partial Activation Technique

4.3.1 Switch Fabric Design

4:1 MU

Fig. 4.8: (a) switch structure, and (b) cross-point and (b) Mux-based switch fabric.

The conventional switch consists of Input Queue (IQ), scheduler, switch

fabric and Output Queue (OQ) as shown in Fig. 4.8(a). There are two kinds of switch

fabric design: a cross-point and Mux-based switch fabric as presented in Fig. 4.8(b)

and (c), respectively. The cross-point switch has pass-transistors at each crossing

junction of input and output wires. In this switch fabric, the capacitive loading driven

by input driver is junction capacitance of pass-transistors on input and output wires

and the wire capacitance itself. The voltage swing on the output wire is reduced to

VDD-Vth_N because of the threshold voltage drop of the NMOS pass-transistor thus the

power dissipation is reduced. However, this design is hard to be synthesized. The

- 73 -

fabric area is determined by the wiring area and not by the transistors so that its area

cost can be the minimum. The Mux-based switch uses multiplexer for each output port.

The capacitive loading driven by the input driver is the input gate capacitance of the

multiplexers and input wire capacitance.

Table 4.2 Comparison of two designs of the switch fabric: power, delay and area.

Power [mW] Delay [psec] Area [mm2]

Switch size

Cross-point

Mux-based

Cross-point

Mux-based

Cross-point

Mux-based

4x4 7.7 12.4 300 370 0.038 0.059

8x8 23.2 52.2 460 580 0.154 0.235

16x16 76.8 217.2 740 1000 0.614 0.941

Table. 4.2 presents the power, speed and area comparison of the two different designs.

It is a simulation result using a capacitive wire model extracted from physical layout.

The power consumption of a cross-point switch is much lower than that of a Mux-

based design: 37%, 56% and 65% lower for 4x4, 8x8 and 16x16 switches, respectively.

Furthermore, the delay of a Mux-based switch is longer than that of a cross-point

switch because of the multiplexer gate-delay. The cross-point switch occupies 65%

area compared to the Mux-based switch.

- 74 -

4.3.2 Crossbar Partial Activation Technique

main RBsc0

Scheduler x 8In

Output driver

(a) Conventional crossbar (b) Proposed crossbar with partial activation

sub RB

sub CB

Fig. 4.9: Schematic diagram of (a) an 8 x 8 conventional crossbar and (b) a proposed crossbar with partial activation technique.

A conventional X-Y based crossbar fabric is shown in Fig. 4.9(a). An n x n

crossbar fabric comprises n2 crossing junctions which contain NMOS pass-transistors.

A NMOS only pass-transistor is used rather than CMOS transmission gate in order to

reduce the voltage swing to VDD-Vth and also to reduce gate loading. In the

conventional crossbar fabric, however, each input driver wastes power to charge and

discharge two long wires – Row-Bar (RB) and Column-Bar (CB) – and 2n transistor-

junction-capacitors. The RB and CB should be laid out in lower metal layers, M1 or

M2, in order to reduce the fabric area, and to minimize the resistance of via. Therefore

the loading on the driver output becomes significant as the number of ports increases.

- 75 -

Fig. 4.9(b) shows a proposed crossbar switch with Crossbar Partial Activation

Technique (CPAT). By splitting the n x n fabric into 4x4 fabrics (or tiles), the

capacitive loading activated is reduce by half. A gated input driver at each tile

activates its sub-RB only when its tile gets a grant from scheduler. Only 4 four-input

OR-gates are needed additionally in each tile regardless the depth of a channel, k. The

output line, CB, is also divided into two sub-CBs to prevent the signal propagation

into other tiles. A 2:1 MUX connect one of two sub-CBs to the output port according

to the grant signals from its scheduler.

Offered load

0510152025

0123456789

10% 30% 50% 70% 90%

Conventional crossbarCrossbar with CPAT

1619 22

Fig. 4.10: Power comparison of an 8x8 crossbar fabric with - and without - crossbar partial activation technique.

An 8x8 crossbar fabric with CPAT is comparatively analyzed with a conventional

one. In this crossbar design, RBs and CBs are laid out in M2 and M1 layers,

respectively. The area of the fabric is about 240x240 um2. According to the

- 76 -

capacitance extraction from the layout, capacitance of a RB and a CB onto the

substrate are 44fF and 28fF, respectively and coupling capacitance between adjacent

bars is 13fF. Fig. 4.10 shows the comparative power consumption according to offered

load. As the offered load increases, the power reduction becomes more significant. At

90% offered load, 22% power saving is obtained. The CPAT cuts down the capacitive

loading on RBs and CBs by half. However, because the power consumption on output

drivers and main RBs are not reduced, the power reduction doesn’t exceed 25%. The

additional OR-gates and MUXs consume less than 2% of overall power. When CPAT

is applied to 16x16 crossbar switch which is dived into 4x4 tiles, 43% power saving is

obtained.

- 77 -

4.4 Low-Energy Coding on On-chip Serial Link

There are many researches on the bus coding for reducing the switching probability

such as bus-invert (BI) coding [49], gray-code [50], T0-code [51] , partial bus-invert

coding [52], probability-based mapping [53] and so on. However, there was also a

report on the ineffectiveness of those on-chip bus coding techniques because the

power dissipation on the (de)coder is comparable to the power saving obtained by the

coding when the wire length is not longer than few tens-mm [54]. Furthermore those

bus coding schemes are effective to parallel buses but ineffective in the multiplexed

channel used on packet switched networks.

As the link wires connecting processing units and switches are abundantly used, the

wiring congestion will becomes one of major challenging on the network-on-chip

design. To alleviate the wiring congestion, a narrow channel [55] or on-chip serialized

channel [9], [11] are proposed. In serial communications, the wire frequency, f, is

multiplied by serialization ratio to support the same bandwidth as in parallel

communication but the number of wires, N, is divided by the serialization ratio. Thus,

the product of f and N is the same in serial and parallel communication channels and

the serialization ratio can be determined by trading off between f and N. However, the

switching activity factor of serial wire, α, is different from that of parallel wires and

the difference depends on the data patterns. Figure 4.11 shows an example for the

- 78 -

comparison of a number of transitions in parallel and serial communications. In this

example, 8bit parallel bus has 7 transitions. However, when the same data stream is

serialized onto a single wire, the number of signal transitions on the wire increase up

to 31. If there is correlation between adjacent data words, some bits of the parallel bus

stay calm without any transition. However, such correlation is not helpful in the serial

communication because data bits are multiplexed onto the single wire. Therefore, the

activity factor of the serial wire gets higher than that of parallel bus statistically. In

common multimedia applications, the most significant bits tend to have high spatial

and temporal correlations because of the sign extension or the locality characteristics

of multimedia streams [20]. In these applications, the serial communication dissipates

more energy than the parallel communication.

- 79 -

(a) 8bit parallel bus

(b) single bit serial bus

01010001

D7D6D5D4D3D2D1D0

W0 W1 W2 W3 W4

7 transitions

01010010

01010011

01010100

01010101

serial data

D7D6D0

R0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0

0 1 0 1 0 0 1 1W2

0 1 0 1 0 1 0 0W3

0 1 0 1 0 1 0 0W4

31 transitions

Fig. 4.11: An example of a number of transitions with the same data pattern on (a) parallel wires and (b) a serial wire.

Many parallel bus coding methods have been proposed to reduce the switching

power on the address or data bus between a processor and memories. However,

such conventional parallel bus coding methods cannot be employed in the serial bus.

Therefore, I proposed a serialized low-energy transmission (SILENT) coding

technique to minimize the transmission energy on the serial wire [21]. In this coding,

transitions are only encoded as symbol ‘1’, thus the difference between successive

data words are encoded.

The encoder works as follows:

- 80 -

B(t)[i] = b(t)[i] ⊕ b(t-1)[i] for i = 0 ~ n-1 (1)

b(t)[n-1:0]: n-bit data word from a sender at time t

B(t)[n-1:0]: n-bit encoded data word at time t

By serializing the encoded data words, the frequency of the appearance of zeros on

the wire increases because of the correlation between the successive data words, b(t).

Figure 4.12 shows an example for the advantage of this coding method. All bits from

B[7] to B[3] become zeros after these data words are encoded because those bits do

not change with time. Serializing these encoded words reduces the number of

transitions of the serial wire as shown in Figure 4.12(d) and the wire looks quiet or

silent. In this example, a conventional serial wire without the SILENT coding, shown

in Figure 4.12(c), has three times as many transitions from t+1 to t+4. By reducing the

number of transitions on the serial wire, the transmission energy can be saved

proportionally.

- 81 -

01010001

01010010

01010011

01010100

01010101

t t+1 t+2 t+3 t+4

b[7]b[6]b[5]b[4]b[3]b[2]b[1]b[0]

2/8 1/8 3/8 1/8

(a) data words from sender

01010001

00000011

00000001

00000111

00000001

t t+1 t+2 t+3 t+4

(b) encoded data words

B[7]B[6]B[5]B[4]B[3]B[2]B[1]B[0]

t 0 1 0 1 0 0 0 1

t+1 0 0 0 0 0 0 1 1

serial data

t+2 0 0 0 0 0 0 0 1t+3 0 0 0 0 0 1 1 1t+4 0 0 0 0 0 0 0 1

(d) serial data with coding

t 0 1 0 1 0 0 0 1t+1 0 1 0 1 0 0 1 0t+2 0 1 0 1 0 0 1 1t+3 0 1 0 1 0 1 0 0t+4 0 1 0 1 0 1 0 1

serial data(c) serial data without coding

CODING

SERIALIZINGSERIALIZING

# Tr # Tr

Fig. 4.12: (a) original data words, (b) encoded data words, (c) conventional serial data with 31 transitions and (d) encoded serial data with 13 transitions.

Figure 4.13 shows the circuit implementation of SILENT codec and the bold line

indicates a critical path in their circuits. It requires a single gate delay for encoding or

decoding and an additionally MUX-delay for enable or disable control. The power

consumption for 32bit data word encoding and decoding is about 390µW and 385µW,

respectively, at 100MHz frequency in the worst case data pattern.

- 82 -

E(a) encoder

(b) decoder - I E

(c) decoder - II

d(t-1)

D(t)d(t)

D(t)d(t)d(t-1)

d(t-1)

b(t-1)

Fig. 4.13: Circuits implementation of (a) encoder and (b, c) decoders.

- 83 -

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

# of transitions b/w successive data words

Energy savingrange

w/o CODING

w/ SILENT CODING

overhead

overheadrange

Energy savingrange

Fig. 4.14: Average power consumption on serial communications with and without SILENT coding.

In order to analyze the energy efficiency of this coding scheme, I evaluate the

energy consumption in the serial communication channel containing 32b-encoders,

32-to-4 serializer, 4bit 8mm serial wires with repeaters, 4-to-32 deserializer, and 32b-

decoders. The energy consumption in the communications depends on the data

patterns to be sent. So, I evaluate the power consumption with all possible variations

from a random data word. Figure 4.14 shows the comparison of the average power

consumption of the serial communication with and without SILENT coding at

100MHz operating frequency. The x-axis stands for the number of data displacement

between successive 32bit data words, b(t). The 0 on the x-axis means that b(t) is the

- 84 -

same as b(t-1), and the 16 means that arbitrary 16bits among 32bits, b(t), have

changed from their previous values, b(t-1). In result, the region under 12 or above 21

in the x-axis is energy saving region due to the SILENT coding. However, there is

some power overhead for random data transitions at most 14% in a region from 12 to

21. As shown here, the energy saving range is two times wider than the overhead

range and the amount of power saving is much larger than the overhead. Therefore,

the SILENT coding has lots of opportunity to save energy in the most of data patterns.

To evaluate the performance of the proposed SILENT coding in a real application,

I trace the transactions of the on-chip traffic between a RISC processor and system

memories while a 3D Graphics application is running [22]. Full 3D Graphics pipelines

of geometry and rendering operations are executed for 3D scenes with 5878 triangles.

Figure 4.16 shows the distribution of the displacement of the memory address and data

for the successive memory accesses. The instruction memory address is so sequential

that the 99.5% of 6 million transactions are within the energy saving region of

SILENT coding. Although the instruction codes are quite random, the 60% is within

the energy saving region. In the case of the data memory access, the 79% and 70% of

1.5 million data memory address and data transactions are within the energy saving

region, respectively. With this memory access pattern, I evaluated the energy

consumption for the serial communications. In result, Fig. 4.15 shows the normalized

- 85 -

average energy consumption on the serial wire with and without SILENT coding. The

energy consumption with SILENT coding includes the energy dissipation in the codec

circuits. The SILENT coding shows the best performance for instruction address,

about 77% energy saving. Even in the random traffic, in the case of the instruction

codes, 13% energy saving is achieved. It also saves 40 ~ 50% transmission energy for

multimedia data traffic. In conclusion, the SILENT coding reduces the energy

consumption of the serial communication in all kinds of on-chip data traffic in the 3D

Graphics application.

: w/o coding : w/ SILENT coding

InstructionAddress

InstructionCode

Data Mem.Address

Data Mem.Data

0.510.62

Fig. 4.15: Normalized average energy consumption in each memory access

- 86 -

n]99% of address60% of code

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

insturction addressinstruction code

3D-SceneA

# of transitions b/w succesive data words

79% of address70% of data

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

data memory addressdata memory data

3D-Scene

Fig. 4.16: Distribution of the displacement between successive (a) instruction and (b) data memory access.

- 87 -

Chapter 5 Implementation and Measurement Results

5.1 Design Flow and Methodology

The Network-on-chip is designed by semi-custom method. Integrated processors

are synthesized, memories are compiled by SRAM compiler and on-chip networks are

full-customized for low-power and high-performance design. Processors and

memories are obtained from vendors and reused by attaching the network interface

and wrappers. Fig. 5.1 shows the design flow; used EDA tools, process stage, and the

output at each stage. It takes 6months from architecture sketch to tape-out by the

manpower of 1 architect and circuit designer, 1 protocol designer and 4 layout artists.

Additional 1 package-board designer supported after the chip fabrication.

- 88 -

Protocol Define

Architecture Define

Protocol Verification

Process Outputs

Protocol Spec. Sheet

C-Model

Used EDA Tools

C-based-Simulator

C-Language

MS-VISIO, Verilog-HDL

Architecture VerificationVerilog-XL

LogicDesign

CricuitDesign

MemoryDesignVerilog-XL Cadence

Synthesis CricuitSimulationSRAM

Compile

DesignCompiler SRAM

Compiler(Dongbu)

EPICHSpice

ManualLayout

Place &Route

SynopsysAstro

CadenceOPUS

Assemble & PnRCadence OPUS-Vurtuso

DRC & LVSMentor Calibre

Tape-Out & FabricationDongbu-Anam 0.18um CMOS

Test Board/PKG DesignCadence ORCAD

DemonstrationMesurement Equip. & FPGA

Behavioral Netlist

Block Layout

Gate-Level Netlist

Full-Chip Layout

Chip Die

Test System

Visual Demo.

GDS-II

Verilog Model

Fig. 5.1: Chip and System Design Flow.

- 89 -

5.2 Chip Implementation and Measurements

Fig. 5.2: Block diagram of a prototype SoC for multimedia applications.

With the proposed NoC architecture, protocol and low-power techniques, I

implemented a multimedia SoC as a prototype. The block diagram is shown in Fig. 5.2.

The chip integrates two clusters; a main cluster and a peripheral cluster. The main

cluster contains two RISC processors, on-chip FPGA, two 64kb SRAM, and an off-

chip gateway. Two RISC processors emulate multiprocessor systems. Off-chip

gateway (OGW) [9] enables seamless off-chip communications with other NoCs on

the same package or boards in order to compose larger scale systems. By using the

- 90 -

OGW, a PU on a die can communicate other PUs on the other dies without protocol

conversion. The peripheral cluster contains three memories to emulate peripheral slave

units. I assumed that the peripheral cluster is located far form the main cluster to

emulate a large SoC, thus two clusters are interconnected via 5mm global link. The

global link uses low-swing signaling to reduce power consumption and also

differential signaling for higher SNR. PLL generates 100MHz clock for main cluster

PUs, 50MHz clock for peripheral cluster units, and 1.6GHz network clock for

switches and network interfaces. The clock frequencies are scalable for power

management modes, i.e. 100/50/1600MHz for FAST mode, 50/25/800MHz for

NORMAL mode, and 25/12.5/400MHz for SLOW mode. PU clocks are not

synchronized each other for the emulation of systems with multiple timing references.

Therefore no effort is needed for clock skew minimization. The on-chip network

supports 3.2GByte/s communication bandwidth for each PU and 11.2GB/s aggregate

bandwidth at FAST mode. The chip is implemented using 0.18µm CMOS process with

6-Al metal layers and its die area takes 5x5mm2. Fig. 5.3 shows the die photograph.

The On-Chip Network power dissipates 51mW at FAST mode with full traffic

condition. Fig. 5.4 shows the power break-down and the effectiveness of the proposed

techniques. By using the low-power techniques such as low-swing signaling, crossbar-

partial activation, and serial-link coding, the overall power consumption is reduced by

- 91 -

38%. The power efficiency index, power consumption per bandwidth, is 4.6mW/GB/s

in this work which is only a ninth compared to previous work [9], 41mW/GB/s.

Fig. 5.3: Die photograph.

- 92 -

Fig. 5.4: On-chip network power reduction by the proposed low-power techniques.

The implemented chip is successfully measured. Fig. 5.5 shows the measured

packet signals on the network at FAST mode, and Fig. 5.6 shows the measured packets

with and without SiLENT coding while a 3D graphics application is running on the

system [22]. The effectiveness of the coding is clearly shown where there are fewer

transitions on a channel.

- 93 -

Fig. 5.5: Measured packet signal on the network.

Fig. 5.6: Measured packet signals (a) without and (b) with SiLENT coding.

Strobe

Packet<0>

EOP 1nsec

1 0 0 0 0 0 1 0 1 0

Header Address

Measured by Tektronix TDS7708

(w/ 40Gbps sampling frequency)

Strobe

Pkt [0]

Pkt [1]

Pkt [2]

Pkt [3]

Pkt [4]

Pkt [5]

Pkt [6]

Pkt [7]

(a) Without SiLENT coding (134 transitions) (b) With SiLENT coding (79 transitions)

- 94 -

5.3 Network-in-Package

Four NoCs are mounted on a single 676-BGA package as shown in Fig 7(c), (d).

The Network-in-Package (NiP) needs four isolated supply voltages: 1.8V for digital

logic, 1.8V for analog circuits, 3.3V for I/O, and sub-0.5V for low-swing links. The

operating frequencies are different for each module: 50MHz for peripheral logic,

100MHz for processors, 800MHz for scheduler, and 1.6GHz for on-chip networks.

Fig. 5.7: NiP simulation (TLM, SSN) and NiP photograph.

- 95 -

The important issue for the package design is the power integrity, i.e. a design of

power and ground (P/G) network. No significant resonance should occur on the P/G

plane of the package at the operating frequency. Otherwise small noise from signals or

external system causes so significant P/G noise that P/G network becomes unstable. In

order to analyze the power integrity, I used Transmission Line Matrix (TLM) Method

and Simultaneous Switching Noise (SSN) analysis. (See Fig. 5.7(a), (b)) I integrated

decoupling-capacitors for each power voltage at the proper position: 5 for logic power,

2 for I/O power, 1 for analog power. Each capacitor has 10nF capacitance and 600pH

Effective Series Inductance properties. Fig. 5.7(a) shows the self-impedance of the

power plane. I targeted the self-impedance of the power plane as 1Ω. The solid-line is

for bare P/G plane and the dotted line is for the proposed decouple-capacitors-insertion.

The impedance of the bare plane shows inductance-characteristics and exceeds the

target impedance at 800MHz and 1.6GHz. After the decoupling-capacitance-insertion,

the impedance resonance occurs at 272.1MHz that comes from the L of bare plane and

C of the inserted-capacitors. As a result, the self-impedance at the target frequency

becomes lower than 1 Ω. Fig. 5.7(b) shows the SSN analysis results when 1.6GHz

input signals are impressed. The switching noise is reduced from 34mV to 2mV due to

the decoupling-capacitor-insertion. I also inserted ground lines between high frequency

signal lines to eliminate crosstalk.

- 96 -

5.4 Demonstration System

I developed a demonstration system with the implemented Networks-in-Package for

multimedia applications. Fig. 5.9 shows the demonstration system which consists of

NiP-board on top, video board on bottom layer, LCD module. The system is now

showing a JPEG-decoded-image on the display processed by embedded processors

through networks on the chip and networks in the package. The high-speed packets up

to 1.6GHz are measured via on-board probing PAD near the package. The Fig. 5.9(d)

shows packet transactions between two NoCs on the NiP where the two NoCs are

running at different clock frequencies, e.g. 400MHz and 274MHz.

Fig. 5.8: Demonstration-system for single-chip package.

- 97 -

(a) Evaluation-Board (mother-board)

(b) high-speed probing-PAD

(c) Images on-the-display: chip-layout, designers, and 3D-scene

NiP Display

Instruction ROM

↑ Video card@ bottom layer

- 98 -

(d) Packet and clock signals on the Network-in-Package

Fig. 5.9: Demonstration-system for Network-in-Package.

- 99 -

Chapter 6 Conclusions

A low-power packet-switched Networks-on-Chip (NoC) with hierarchical star

topology is designed and implemented for high-performance SoC design. According

to a performance and cost oriented topology analysis, the hierarchical star topology

shows the best cost-efficiency and also the lowest latency.

The chip contains two RISC processors for multiprocessor emulation, two 64kb

SRAM, on-chip FPGA, off-chip gateway for off-chip network interface, three 4kb

SRAM for peripheral logic emulation, 1.6GHz PLL for internal clock generation, and

on-chip networks connecting those processing units. On-chip network channel is

serialized from 80bits onto 8bits to reduce the network area significantly. Source-

synchronous signaling enables plesiochronous communications between processing

- 100 -

units running at different clock frequencies. Low-power consumption is achieved by

applying various techniques such as lower swing signaling on a global link, Mux-Tree

based round-robin scheduler in a router, crossbar partial activation, low-energy serial-

link coding, and clock frequency scaling. The chip integrates 2.5 million transistors

and consumes less than 160mW and the on-chip network consumes less than 51mW

delivering 11.2GB/s aggregated network bandwidth. The 5x5mm2 chip is fabricated

with 0.18µm CMOS process and the four fabricated chips are integrated in a single

BGA package to organize Networks-in-Package (NiP) for large scalable systems with

low-cost. The implemented NoC and NiP are successfully measured and demonstrated

on a system evaluation board running multimedia applications.

- 101 -

BONETM 2.0 Specification

Semiconductor System Lab., Dept. of EE, KAIST

Se-Joong Lee, shocktop@eeinfo.kaist.ac.kr

- 102 -

Build Number: 009

- 103 -

Contents

BONE Specification

Chapter 1 Introduction to the BONE 1.1 Overview of the BONE specification ………………….. 04 1.2 Overall Architecture of the BONE ……………………... 04 1.3 Features ………………………………………………… 05 1.4 Terminology ……………………………………………. 06 Chapter 2 BONE signals 2.1 Master Network Interface (MNI) ………………………. 07 2.2 Up_Sampler (UPS) …………………………………….. 09 2.3 Switch (SW) ……………………………………………. 09 2.4 Dn_Sampler (DNS) …………………………………….. 10 2.5 Slave Network Interface (SNI) …………………………. 11 Chapter 3 BONE data-plane protocol 3.1 Packet formation ……………………………………….. 13 3.2 Packet transaction-level protocol ………………………. 15 3.3 Control signals …………………………………………. 17 3.4 NI Timing diagram (basic packet transfer) …………….. 18 3.5 NI Timing diagram (burst packet transfer) …………….. 22 3.6 NI Timing diagram (acknowledge, flow-control) ……… 30 3.7 UPS/DNS Timing diagram …………………………….. 33 3.8 SW Timing diagram ……………………………………. 35 Chapter4 BONE control-plane protocol (Under construction) Chapter5 BONE peripherals (Under construction)

- 104 -

Chapter 1

Introduction to the BONE

1.1 Overview of the BONE

The Basic On-chip Network (BONE) specification defines an on-chip

communication architecture and its protocol standards for designing high-

performance, application specific, and very large scale system-on-chips (SoCs).

The BONE which is built based on network architecture interconnects

integrated functional units or intellectual properties (IPs), provides sufficient

bandwidth and minimum latency, especially without global synchronization.

1.2 Overall Architecture

The Bone consists of 5 kinds of components: Master Network Interface

(MNI), Slave Network Interface (SNI), Up_Sampler (UPS), Dn_Sampler

(DNS), and Switch (SW). The MNI connects a master to the BONE. Using the

UPS and DNS, the BONE serializes and deserializes packets. The non-

blocking SW routes packets. A path from MNI to SNI is called forward_path,

and a reverse path is called backward_path.

- 105 -

Forward_path

Master

UPS DNS

SW Backward_path

Figure 1.1.1 Overall architecture

1.3 Features

1.3.1 Physical Layer

* 32b ADDR, 32b DATA, and 8b SIG

* 7b routing information (RI) for packet routing

* fCLK_MASTER < fCLK_BONE

* fCLK_SLAVE < fCLK_BONE

* No synchronization between different clocks

1.3.2 Datalink Layer

* Serialization / Deserialization

* 2-level flow control

1.3.3 Network Layer

* 7b source routing

* Control Packet for managing the network

- 106 -

* Cut-through switching

1.3.4 Transport Layer

* Encoding / Decoding scheme to reduce packet size

* Acknowledge Packet for end-to-end hand-shaking

* Compact Packet for burst read/write operation

* Direct Signal (SIG) field for directed-mapped signal transfer

* 4 kinds of Burst Length is available: 1,2,4, and 8

- 107 -

1.4 Terminology

The following terms are used throughout this specification.

phase_1, _2 Phase_1 and phase_2 indicate front and back half period

of clock cycle.

forward_packet A packet transferred through forward_path

backward_packet A packet transferred through backward_path

burst_period Burst_period is the clock cycles in which burst packets

are transmitted.

burst_start_period Burst_start_period is the clock cycle at which

burst_period starts.

burst_ext_period Burst_ext_period is the clock cycles in which burst

operation continues after burst_start_period.

burst_read Burst_read is a read command which expects burst read

out data

burst_write Burst_write is a series of write operations whose address

space is continuous.

D0 D1 D2 D3

burst_period

burst_ext_periodburst_start_period

phase_1 phase_2

FD<31:0>

Figure 1.4.1 Terminologies about burst operation

- 108 -

Chapter 2

BONE signals

2.1 Master Network Interface (MNI)

FWRITE

FD<31:0>

FA<31:0>

BDO<31:0>

FS<7:0>

FACKREQ

BSO<7:0>

FHO<23:0>

FAO<31:0>

WTMSREQ

WTDNSREQ WTDNS

FBL<1:0>

FAEN FDO<31:0>

BH<23:0>

BA<31:0>

BD<31:0>

Figure 2.1 MNI interface diagram

- 109 -

WTMSREQ Wait Master Request.

WTMS Wait Master. Bypassing WTMSREQ. Controls packet flow from a

master

WTDNSREQ Wait Down_Sampler Request.

WTDNS Wait Down_Sampler. Bypassing WTDNSREQ. Controls packet flow

from DNS.

FWRITE Forward path HIGH for write command, LOW for read command

FBL<1:0> Forward path Burst Length (00:1, 01: 2, 10:4, 11:8)

FPRT Forward path Priority (1: High priority 0: Low priority)

FSEN Forward path Direct Signal Enable.

FS<7:0> Forward path Direct Signals

FAEN Forward path Address Enable

FA<31:0> Forward path Address

FDEN Forward path Data Enable

FD<31:0> Forward path Data

FACKREQ Forward path Acknowledge Request. NI which receives a packet

with ACKREQ must reply an ACK packet.

FHOEN Forward path Header Output Enable.

FHO<23:0> Forward path Header Output

FAOEN Forward path Address Output Enable

FAO<31:0> Forward path Address Output

FDOEN Forward path Data Output Enable

FDO<31:0> Forward path Data Output

BHEN Backward path Header Enable

BH<23:0> Backward path Header

BAEN Backward path Address Enable

BA<31:0> Backward path Address

BDEN Backward path Data Enable

- 110 -

BD<31:0> Backward path Data

BACKn Backward path Acknowledge Not. Set to ‘H’ when ACKREQ is

asserted. Set to ‘L’ when an ACK packet arrives.

BSOEN Backward path Direct Signal Output Enable

BSO<31:0> Backward path Direct Signal Output

BDOEN Backward path Data Output Enable

BDO<31:0> Backward path Data Output

Table 2.1 MNI signals

- 111 -

2.2 Up_Sampler (UPS)

WTNIREQWTNI

DATASTB

LINK<7:0>

SRCDEN

SRCH<23:0>

SRCA<31:0>

SRCD<31:0>

SRCAEN

SRCHEN

Figure 2.2 UPS interface diagram

WTNIREQ Wait Network Interface Request.

WTNI Wait Network Interface. Retimeing WTNIREQ.

SRCHEN Source Header Enable. Source may be one of MNI or SNI.

SRCH<23:0> Source Header

SRCAEN Source Address Enable.

SRCA<31:0> Source Address

SRCDEN Source Data Enable

SRCD<31:0> Source Data

DATASTB Datastb. Destination retimes packets at rising edge.

EOP End-of-packet.

LINK<7:0> Link through which packets are delivered.

Table 2.2 UPS signals

- 112 -

2.3 Switch (SW)

WTSWREQxWTUPSx

DATASTBJx

LINKJx<7:0>

DATASTBOx

LINKOx<7:0>

Figure 2.3 SW interface diagram

WTSWREQx Wait Switch Request. Switch must slow down or even stop

packet transfer. ‘x’ indicates port no.

WTUPSx Wait UPS. Switch buffer is close to overflow.

DATASTBJx Datastb Input. Switched w/o retiming.

EOPJx End-of-packet Input. Switched w/o retiming.

LINKJx<7:0> Link Input.

DATASTBOx Datastb Output

EOPOx End-of-packet Output

LINKOx<7:0> Link Output

Table 2.3 SW signals

- 113 -

2.4 Dn_Sampler (DNS)

WTSW WTSWREQ

DATASTB

LINK<7:0>

DSTDOEN

DSTHO<23:0>

DSTAO<31:0>

DSTDO<31:0>

DSTAOEN

DSTHOEN

Figure 2.4 DNS interface diagram

WTSWREQ Wait Switch Request

WTSW Wait Switch. Retiming WTSWREQ.

DSTHOEN Destination Header Output Enable.

DSTHO<23:0> Destination Header Output.

DSTAOEN Destination Address Output Enable.

DSTAO<31:0> Destination Address Output

DSTDOEN Destination Data Output Enable

DSTDO<31:0> Destination Data Output

Table 2.4 DNS signals

- 114 -

2.5 Slave Network Interface (SNI)

BD<31:0>

BS<7:0>

WTDNSREQWTDNS

FH<23:0>

FA<31:0>

WTSLWTSLREQ

FDO<31:0>

FSO<7:0>

FAO<31:0>

FD<31:0>

FWRITE

BHO<23:0>

BDO<31:0>

FBLO<1:0>

FTO<15:0>

Figure 2.5 SNI interface diagram

WTDNSREQ Wait DNS Request. A slave can performs backpressure by this

signal.

WTDNS Wait DNS. Bypassing WTDNSREQ.

WTSLREQ Wait Slave Request. A switch controls packet flow from a

slave.

WTSL Wait Slave. Bypassing WTDNSREQ.

- 115 -

FHEN Forward path Header Enable

FH<23:0> Forward path Header.

FAEN Forward path Address Enable

FA<31:0> Forward path Address

FDEN Forward path Data Enable

FD<31:0> Forward path Data

FWRITE Forward path ‘H’ for write operation. ‘L’ for read operation

FBLO<1:0> Forward path Burst Length Output

FTOEN Forward path Tag Output Enable

FTO<15:0> Forward path Tag Output. This Tag Information is used for

backward_packet from slave to master.

FSOEN Forward path Direct Signal Output Enable

FSO<7:0> Forward path Direct Signal Output

FAOEN Forward path Address Output Enable

FAO<31:0> Forward path Address Output

FDOEN Forward path Data Output Enable

FDO<31:0> Forward path Data Output

BSEN Backward path Direct Signal Enable

BDEN Backward path Data Enable

BS<7:0> Backward path Direct Signals.

BD<31:0> Backward path Data.

BHOEN Backward path Header Output Enable

BHO<23:0> Backward path Header Output

BDOEN Backward path Data Output Enable

BDO<31:0> Backward path Data Output

Table 2.5 SNI signals

- 116 -

Chapter 3

BONE data-plane protocol

3.1 Packet formation

3.1.1 Packet format

Compact Packet

7 6 5 4

3 2 1 0RIC=10

C_HEADER<7:0>

Acknowledge Packet

7 6 5 401

3 2 1 0RIC=0

P 1 000 0 00

Normal Packet

7 6 5 4

3 2 1 0RIC=0

P W SAD AC BL0123456789

HEADER<23:0>TAG

<15:0>

ADDR<31:0>

DATA<31:0>

Figure 3.1.1 Packet formats

- 117 -

C Compact. ‘1’ for compact packet, ‘0’ for others.

RI Route Information.

P Priority. ‘1’ for high priority

W Write

SAD 3b bitmap encoding. Each bit indicates weather the

corresponding field exists or not. Ex) 101 means Direct

Signal and Data field exist.

AC Acknowledge Request.

BL Burst Length

HC Hop Count. Every switch performs shift-right. Control

packet is switched according to RI until HC becomes 0.

Table 3.1.1 TAG field description

HEADER includes TAG and Direct Signal (SIG) field. In Compact

Packet, header is only 1 Byte which is called C_HEADER. Header of

Control Packet contains Hop Count (HC) field instead of SIG field.

Acknowledge Packet consists of TAG.

3.1.2 Route information field (RI)

7b RI contains route information from a master to a slave. The RI

field can is divided into a few sub-field, each corresponds to output port at

each switch hop. RI modification, deleting head index and attaching

flipped source port index, is required to record route path. RI for reverse

path can be obtained by bit-wise flip operation. RI modification and

- 118 -

flipping for reverse path RI are depicted in Figure 3.2.

RI flipping (SNI performs)

J11 J10 J22 J21 J20 J31 J30

J'11J'10 J'22J'21J'20 J'31J'30

J'11J'10 J'22J'21J'20J31 J30

J'11J'10J22 J21 J20 J31 J30

J'31 J'30 J'22 J'21 J'20 J'11 J'10

RI modification (SW performs)

RI for Master -> Slave

RI for Slave -> Master

Index for 1st hop2nd 3rd

Figure 3.1.2 RI modification and flipping

- 119 -

3.2 Packet transaction-level protocol

Master Slave

RIC=0P W=0 S AC=0 BL = 00A=1 D=0

RI-1C=0P W=0 S AC=0 BL = 00A=0 D=1

DATATAG

Figure 3.2.1 Basic read packet transaction

Master Slave

RIC=0P W=1 S AC=0 BL = 00A=1 D=1

ADDRDATA

Figure 3.2.2 Basic write packet transaction

- 120 -

Master Slave

RIC=0P=0 W=0 S AC=0 BLA=1 D=0

DATATAG

RI-1C=0P=0 W=0 S AC=0 BLA=0 D=1

DATACH DATAC

RI-1C=1

Figure 3.2.3 Low priority burst read packet transaction

Master Slave

DATA CH DATA C

ADDRDATA...

Figure 3.2.4 Low priority burst write packet transaction

- 121 -

Master Slave

DATATAG

RI-1C=0P=1 W=0 S0 AC=0 BLA=0 D=1

...DATATAG

DATATAG

Figure 3.2.5 High priority burst read packet transaction

Master Slave

RIC=0P=1 W=1 S0 AC=0 BLA=1 D=1

ADDRDATA...TAG

DATATAG

Figure 3.2.6 High priority burst write packet transaction

- 122 -

Master Slave

RIC=0P W=1 S AC=1 BLA=1 D=1

ADDRDATA

% In the case of burst write, only the firstpacket can request acknowledgement.

RI-1C=0P W=1 000 0 00

Figure 3.2.7 Low/high priority, single/burst write w/ ackreq packet

transaction

Master/

Slave/

Master

RIC=0P W=1 S=1 AC=0 BL=00A=0 D=0

Figure 3.2.8 Direct-signal packet transaction

- 123 -

3.3 Control signals

3.3.1 Serialization

UPS serializes packets to transfer them through 8b LINK. To indicate

packet length, EOP (end-of-packet) signal is transmitted together as shown in

Figure 3.7.1.

3.3.2 Flow control

Masters, switches, and slaves can generate ‘WT???’ signal to control flow

rate. MNI, UPS, DNS, and SNI only bypass or retime the flow control signal.

- 124 -

3.4 NI Timing diagram (basic packet transfer)

3.4.1 Basic read packet transaction

FPRT0, 0

BH0, BD0

FD<31:0>

FA<31:0>

FHO<23:0>

FS<7:0>

BDO<31:0>

FDO<31:0>

BHEN,BDEN

BSO<7:0>

FBL<1:0>

FPRT,FWRITE

FAO<31:0>

BH<23:0>,BD<31:0>

Figure 3.4.1 (Master) MNI (UPS) protocol for basic read

Outputs of a master are latched-output. Settled in phase_1.

Forward_path latency is zero [MNI_01].

Before it outputs BS<7:0> and BD<31:0>, MNI examines whether a

- 125 -

packet is destined to the MNI (i.e. Control Packet) or master. If the packet is

destined to the MNI, BSOEN and BDOEN are disabled [MNI_02].

Backward_path latency is zero [MNI_03].

- 126 -

FH0, FA0

0, 0 0

BHO0, BDO0

FHEN,FAEN

FH<23:0>FA<31:0>

FD<31:0>

FWRITE,FBLO<1:0>

FSO<7:0>

FAO<31:0>

FDO<31:0>

FTO<15:0>

BS<7:0>

BD<31:0>

BHOEN,BDOEN

BHO<23:0>,BDO<31:0>

BT<15:0>

Figure 3.4.2 (DNS) SNI (Slave) protocol for basic read

SNI does not retime forward_path inputs, thus transfer them as soon

as possible. In other words, forward_path latency is zero [SNI_01].

SNI examines whether a packet is a Control Packet or not. If it is,

FSOEN, FAOEN, and FDOEN are disabled. Then, SNI interprets the Control

Packet [SNI_02].

- 127 -

For read operation, SNI modifies FH, so generates FTO to use it for a

relevant backward_packet. C, P, and BL fields are not changed. RI is flipped.

W is set to ‘H’. SAD is properly set. AC is set to ‘L’. The FTO must be delayed

by the amount of slave latency so that it is output through BT together with BD.

[SNI_03].

Backward_path inputs may be settled in phase_2, thus backward_path

latency is 1 [SNI_04].

- 128 -

3.4.2 Basic write packet transaction

FPRT0, 1

FD<31:0>

FA<31:0>

FHO<23:0>

FS<7:0>

FDO<31:0>

FBL<1:0>

FPRT,FWRITE

FAO<31:0>

Figure 3.4.3 (Master) MNI (UPS) protocol for basic write

FH0, FA0, FD0

1, 0 0

FAO0, FDO0

FHEN,FAEN,FDEN

FH<23:0>,FA<31:0>,FD<31:0>

FWRITE,FBLO<1:0>

FSO<7:0>

FAO<31:0>,FDO<31:0>

FAOEN,FDOEN

FTO<15:0>

Figure 3.4.4 (DNS) SNI (Slave) protocol for basic write

FTOEN is disabled for write operation.

- 129 -

3.5 NI Timing diagram (burst packet transfer)

3.5.1 Burst read packet transaction (low priority)

BH0, BD0

BH1, BD1

BH1, BDx

FD<31:0>

FA<31:0>

FHO<23:0>

FS<7:0>

BDO<31:0>

FDO<31:0>

BHEN,BDEN

BSO<7:0>

FBL<1:0>

FPRT,FWRITE

FAO<31:0>

BH<23:0>,BD<31:0>

BDO1 Figure 3.5.1 (Master) MNI (UPS) protocol for burst read w/ low priority

All signals are valid during burst_ext_period [MNI_04].

BSOEN is disabled during burst_ext_period [MNI_05].

- 130 -

FH0, FA0

0, FBLO0

BHO0, BDO0

BHO1, BDO1

FHEN,FAEN

FH<23:0>FA<31:0>

FD<31:0>

FWRITE,FBLO<1:0>

FSO<7:0>

FAO<31:0>

FDO<31:0>

FTO<15:0>

BS<7:0>

BD<31:0>

BHOEN,BDOEN

BHO<23:0>,BDO<31:0>

BT<15:0>

BD1 BDx

BHO1, BDOx

Figure 3.5.2 (DNS) SNI (Slave) protocol for burst read w/ low priority

SNI supports burst_read operation. BHO outputs C_HEADER if

priority of BT0 is ‘L’ [SNI_05].

- 131 -

3.5.2 Burst write packet transaction (low priority)

FDO0 FDO1

FD1 FD2FD0

FD<31:0>

FA<31:0>

FHO<23:0>

FS<7:0>

FDO<31:0>

FBL<1:0>

FPRT,FWRITE

FAO<31:0>

FHO1 FHO1 FHO1

Figure 3.5.3 (Master) MNI (UPS) protocol for burst write w/ low priority

(FH0, FA0, FD0) constitutes a Normal Packet. (FH1, FD1), (FH1,

FD2), (FH1, FD3) constitute Compact Packets when FPRT is ‘L’

[MNI_06].

FSEN and FAEN are disabled during burst_ext_period [MNI_07]

- 132 -

FD1 FD2

FH0, FA0

1, FBLO0

FDO1 FDO2FDO0

FHEN,FAEN

FH<23:0>FA<31:0>

FD<31:0>

FWRITE,FBLO<1:0>

FSO<7:0>

FAO<31:0>

FDO<31:0>

FTO<15:0>

FDO3 Figure 3.5.4 (DNS) SNI (Slave) protocol for burst write w/ low priority

FTOEN is disabled [SNI_06]

- 133 -

3.5.3 Burst read packet transaction (high priority)

BH0, BD0

BH1, BD1

BHx, BDx

FD<31:0>

FA<31:0>

FHO<23:0>

FS<7:0>

BDO<31:0>

FDO<31:0>

BHEN,BDEN

BSO<7:0>

FBL<1:0>

FPRT,FWRITE

FAO<31:0>

BH<23:0>,BD<31:0>

BDO1 Figure 3.5.5 (Master) MNI (UPS) protocol for burst read w/ high priority

BSOEN may be asserted during burst_ext_period to receive

multiple Direct Signals [MNI_08].

(BH, BD) constitutes a Normal Packet [MNI_09].

- 134 -

FH0, FA0

0, FBLO0

BHO0, BDO0

BHO1, BDO1

FHEN,FAEN

FH<23:0>FA<31:0>

FD<31:0>

FWRITE,FBLO<1:0>

FSO<7:0>

FAO<31:0>

FDO<31:0>

FTO<15:0>

BS<7:0>

BD<31:0>

BHOEN,BDOEN

BHO<23:0>,BDO<31:0>

BT<15:0>

BS1 BSx

BD1 BDx

BHOx, BDOx

Figure 3.5.6 (DNS) SNI (Slave) protocol for burst read w/ high priority

BSEN may be asserted for multiple cycles to transfer multiple

Direct Signals [SNI_07].

- 135 -

3.5.4 Burst write packet transaction (high priority)

FDO0 FDO1

FS1 FS2

FD1 FD2FD0

FD<31:0>

FA<31:0>

FHO<23:0>

FS<7:0>

FDO<31:0>

FBL<1:0>

FPRT,FWRITE

FAO<31:0>

FHO1 FHO2 FHO3

Figure 3.5.7 (Master) MNI (UPS) protocol for burst write w/ high priority

FSEN may be asserted during burst_ext_period to transfer multiple

Direct Signals [MNI_10].

(FHO, FAO, FDO) constitutes a Normal Packet [MNI_11].

- 136 -

FSO1 FSO2FSO0

FDO1 FDO2FDO0

FH<23:0>

FD<31:0>

FBLO<1:0>

FSO<7:0>

FAO<31:0>

FDO<31:0>

FTO<15:0>

FA<31:0>

FWRITE

Figure 3.5.8 (DNS) SNI (Slave) protocol for burst write w/ high priority

FHEN is asserted for burst_ext_period [SNI_08].

- 137 -

3.6 NI Timing diagram (acknowledge, flow-control)

3.6.1 Acknowledge packet transaction

FACKREQ

BH<23:0> Figure 3.6.1 Acknowledge request by MNI

BACKn must be asserted if FACKREQ is enabled. Deasserted

when BH includes an Acknowledge Packet [MNI_12].

FH<23:0>

BHO<23:0>

BDO<31:0>

Figure 3.6.2 Acknowledge reply by SNI

If acknowledge request field of FH0 is enabled, SNI replies Acknowledge

Packet in next clock cycle [SNI_09].

- 138 -

3.6.2 Flow control

FDO0 FDO1

FD1FD0

FPRT0, 1

FS2 FS3

FD2 FD3FD<31:0>

FA<31:0>

FHO<23:0>

FS<7:0>

FDO<31:0>

FBL<1:0>

FPRT,FWRITE

FAO<31:0>

FHO1 FHO1 FHO1

FDO3 Figure 3.6.3 Flow-controlled burst write at MNI

BH1, BD1BH0, BD0

BDO1BDO0

BH2, BD2

BH3, BD3

BDO<31:0>

BHEN,BDEN

BSO<7:0>

BH<23:0>,BD<31:0>

BSO2 BSO3

Figure 3.6.4 Flow-controlled burst read at MNI

Burst forward_packet of an MNI is flow-controlled by disabling

FDEN and FSEN [MNI_13].

Burst backward_packet of an MNI is flow-controlled by disabling

BHEN and BDEN [MNI_14].

- 139 -

FSO1FSO0

FDO1FDO0

FD2 FD3

FH2 FH3

FH<23:0>

FD<31:0>

FBLO<1:0>

FSO<7:0>

FAO<31:0>

FDO<31:0>

FTO<15:0>

FA<31:0>

FWRITE

Figure 3.6.5 Flow-controlled burst write at SNI

BHO2, BDO2BHO1, BDO1BHO0, BDO0

BS1 BS2

BHO3, BDO3

BS<7:0>

BD<31:0>

BHOEN,BDOEN

BHO<23:0>,BDO<31:0>

BT<15:0>

BD2 BD3

Figure 3.6.6 Flow-controlled burst read at SNI

Burst forward_packet of an SNI is flow-controlled by disabling

FHEN and FDEN [SNI_10].

Burst backward_packet of an SNI is flow-controlled by disabling

BSEN and BDEN [SNI_11].

- 140 -

3.7 UPS/DNS Timing diagram

When a UPS serializes a packet, it is serialized by the order of Header,

Address, and Data. The operation of the UPS and DNS is regardless of

packet transfer types like read/write, basic/burst, and acknowledgement.

EOP signal is used to indicate the end of the packet.

SRCA0 SRCA1SRCA<31:0>

CLK_UPS

SRCD1SRCD<31:0>

SRCHEN

SRCH0 SRCH1 SRCH2SRCH<23:0>

SRCAEN

SRCDEN

(1) (2) (3)

L0 L1 L2 L3 L4 L5 L6

DATASTB

LINE<7:0>

CLK_UPS

L0 L1 L2 L3 L4 L5 L6 L7

DATASTB

LINE<7:0>

DATASTB

LINE<7:0>

H H H A A A A

H H H A A A A D

L8 L9 L10D D D

L2 L3 L4D D D

Figure 3.7.1 (MNI) UPS (SW)

The minimum delay time, tUPSd, which is required by UPS to start

- 141 -

serialization, is open for actual design. This specification does not

fix the value.

L0 L1 L2 L3 L4 L5 L6 L7H H H A A D DD

DSTHO0

DSTAO0

DSTDO0

CLK_DNS

DATASTB

LINE<7:0>

DSTHOEN

DSTHO<31:0>

DSTAOEN

DSTAO<31:0>

DSTDOEN

DSTDO<31:0>

DL8 L9 L10

Figure 3.7.2 (SW) DNS (SNI)

The minimum delay time, tDNSd, which is required by DNS to

finish de-serialization, is open for actual design. This specification

does not fix the value.

- 142 -

3.8 SW Timing diagram

L0 L1 L2 L3 L4 L5 L6

CLK_SW

DATASTBOx

LINEOx<7:0>

L0 L1 L2 L3 L4 L5H H H D D D

DATASTBJx

LINEJx<7:0> L6D

Figure 3.8.1 Switch delay timing diagram

The switching delay time, tSWd, is open for actual design. This

specification does not fix the value.

L0 L1 L2 L3 L4 L5 L6H H H A A A A

CLK_SW

DATASTBOx

LINEOx<7:0>

WTSWREQx

L0 L1 L2 L3 L4 L5

L0 L1 L2 L3 L4 L5H H H D D D

DATASTBJx

LINEJx<7:0> L0 L1 L2 L3

WTUPSx

tWT_UPS_R

tWT_UPS_F

tWT_SW_FtWT_SW_R

Figure 3.8.2 Switch timing for flow-control

The minimum setup time to request wait of a switch, tWT_SW_R,

is 2 clock cycles [SW_01].

- 143 -

The minimum delay time from de-assertion of WTSWREQ to

LINEO, tWT_SW_F, is 1 clock cycle [SW_02].

The minimum setup time of WTUPS to request wait of a UPS

before the end of previous packet, is open for actual design. It may

depend on the propagation delay and response time of the UPS.

The minimum delay time from de-assertion of WTUPS to LINEJ is

1 clock cycle. [SW_03]

- 144 -

국문 요약

패킷-스위칭 네트워크-온-칩 (NoC)이 고성능 SoC를 위해 저전력으로

설계되었고 실리콘 공정으로 제작되었다. 본 연구는 NoC 구조

결정에서부터 시스템 시연까지 전체적인 NoC설계 방법에 대한 것이다.

우선 Topology 결정을 위하여 성능 및 전력, 면적에 관한 비교 분석을

하였다. 버스, Mesh, Star, Point-to-point 와 같은 Basic topology뿐 아니라, 이

Basic Topology들로 구성된 Hierarchical / Heterogeneous Topology에 대해서도

비교를 하였다.

둘째로 NoC 구조 및 구성 요소에 관하여는 Switching방법, 패킷 동기화,

통신선 직렬화, 프로토콜 그리고 Buffering 방법 등을 분석하였다.

제작된 칩은 Multiprocessor의 에뮬레이션을 위한 두 개의 RISC 프로세서와

두 개의 64kbit SRAM, 온-칩-FPGA, 칩-외부-네트워크와의 연결을 위한 Off-

chip-Gateway, Peripheral logic의 에뮬레이션을 위한 3개의 4kbit SRAM, 1.6GHz

의 PLL, 그리고 이들간의 통신수단으로서 온-칩-네트워크가 집적되었다. 이

온-칩-네트워크의 채널은 온-칩 면적과 복잡도를 획기적으로 줄이기 위해

80bit에서 8bit으로 직렬화되었다. 또한 서로 다른 Clock 주파수로 동작하는

여러 온-칩-유닛 사이의 Plesiochronous 통신을 위해 Source-synchronous

signaling을 사용하고 있다. 본 논문에서는 다음과 같은 온-칩-네트워크에서

- 145 -

의 여러 저전력 기술을 제안 및 응용하였다. 채널의 Low-swing signaling,

Mux-Tree방식의 Round-robin 스케쥴러, 크로스바 부분 활성화 기술, 직렬통

신에서의 저전력 채널코딩, 동작 주파수 스케일링 등이다. 이 칩은 최고

160mW를 소모하고, 제안된 온-칩-네트워크는 51mW를 소모하며 11.2G/s의

통신 대역폭을 제공한다. 0.18µm CMOS 공정으로 제작된 25 mm2 면적의 이

칩은 회로의 동작이 검증되었으며 멀티미디어 어플리케이션을 시연하고 있

다. 마지막으로 제작된 칩 4개를 하나의 Package에 집적하여, 더 큰 시스템

을 구성할 수 있도록 하는 네트워크-인-패키지 (NiP) 기술을 제안 및 제작

하고 이를 측정하였다.

- 146 -

Bibliography

[1] International Technology Roadmap for Semiconductors, http://public.itrs.net

[2] D. Bertozzi et al., “Xpipes: A Network-on-Chip Architecture for Gigascale

System-on-Chip,” IEEE Circuits and Systems Magazine, vol. 4, issue 2, pp.18-31,

[3] L. Benini et al., “Networks on Chips: A New SoC Paradigm,” Computer, vol. 36,

pp. 70 – 78, Jan. 2002.

[4] S. Kumar, et al., “A Network on Chip Architecture and Design Methodology,” in

Proc. IEEE Computer Society Annual Symposium on VLSI, Apr. 2002, pp. 105-112.

[5] S. Murali, et al., “SUNMAP: A Tool for Automatic Topology Selection and

Generation for NoCs,” in Proc. Design and Automation Conf., June 2004, pp 914-919.

[6] E. Rijpkema, et al., “Trade Offs in the Design of a Router with Both Guaranteed

and Best-Effort Services for Networks on Chip,” in Proc. Design, Automation and Test

Conf. March 2003, pp. 350-355.

[7] F. Worm, et al., “An Adaptive Low-Power Transmission Scheme for On-chip

Networks,” in Proc. Int. Symposium on System Synthesis, Oct. 2002, pp. 92-100.

[8] V. Nollet, et al., “Operating-System Controlled Network on Chip,” in Proc. Design

and Automation Conf., June 2004, pp 256-259.

- 147 -

[9] S.-J. Lee, et al., An 800MHz Star-Connected On-Chip Network for Application to

Systems on a Chip,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb.

2003, pp. 468-469.

[10] M. Taylor, et al., “A 16-Issue Multiple-Program-Counter Microprocessor with

Point-to-Point Scalar Operand Network,” in IEEE Int. Solid-State Circuits Conf. Dig.

Tech. Papers, Feb. 2003, pp. 170-171.

[11] K. Lee, et al., “A 51mW 1.6GHz On-Chip Network for Low-Power

Heterogeneous SoC Platform,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech.

Papers, Feb. 2004, pp. 152-153.

[12] W. Dally, et al., “Route Packets, Not Wires: On-Chip Interconnection Networks,”

in Proc. Design and Automation Conf., June 2001, pp 684-689.

[13] S. Kimura, et al., “An On-Chip High Speed Serial Communication Method Based

on Independent Ring Oscillators,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech.

Papers, Feb. 2003, pp.390-391

[14] R. Ho. et al., “Efficient On-Chip Global Interconnects,” in Symp. VLSI Circuits

Dig. Tech. Papers, June 2003, pp. 271-274.

[15] C. Svensson, “Optimum Voltage Swing on On-Chip and Off-Chip interconnect,”

IEEE J. of Solid-State Circuits, vol. 36, pp. 1108 - 1112, July 2001.

[16] S. Shahrier, et al., “A Fast Round Robin Priority Port Scheduler for High

- 148 -

Capacity,” in Proc. IEEE International Conference on ATM, April 2001, pp. 173-180.

[17] P. Gupta, et al., “Design and Implementing a Fast Crossbar Scheduler,” IEEE

Micro, vol. 19, pp. 20-28, Jan. 1999.

[18] K. Lee, et al., “A Variable Round-Robin Arbiter for High Speed Buses and

Statistical Multiplexes,” in Proc. Int. Phoenix Conference on Computers and

Communications, March 1991, pp 23-29.

[19] E. Shin, et al., “Round-robin Arbiter Design and Generation,” in Proc. IEEE Int.

Symp. System Synthesis, pp 243-248, Oct. 2002.

[20] P. Landman, et al., "Architectural Power Analysis: The Dual Bit Type Method,"

IEEE Trans. VLSI Syst., vol.3, pp. 173-187, June 1995.

[21] K. Lee, et al., “SILENT: Serialized Low-Energy Transmission Coding for On-

Chip Interconnection Networks,” in IEEE Int. Conf. Computer Aided Design Dig.

Tech. Papers, Nov. 2004, pp. 448-451.

[22] R. Woo, et al., “A 210-mW Graphics LSI Implementing Full 3-D Pipeline With

264 Mtexels/s Texturing for Mobile Multimedia Applications,” IEEE J. of Solid-State

Circuits, vol. 39, pp. 358 - 367, Feb. 2004.

[23] J.-S. Kim, et al., “On-Chip Network based Embedded Core Testing,” in Proc.

IEEE Int. SOC Cof., pp. 223-226, Sept. 2004.

[24] KAIST Network-on-Chip working group http://ssl.kaist.ac.kr/ocn

- 149 -

[25] D. Geer, “Chip Makers Turn to Multicore Processors,” IEEE Computer, vol. 38,

issue 5, pp.11-13, May 2005.

[26] D. Pham, et al., “The Design and Implementation of a Fist-Generation CELL

processor,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, 2005, pp. 184-

[27] S. Torii, et al., “A 600MIPS 120mW 70 A Leakage Triple　 -CPU Mobile

Application Processor Chip,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers,

2005, pp. 136-137.

[28] P. Guerrier, et al., “A Generic Architecture for On-Chip Packet-Switched

Interconnections,” In Proc. Conf. on Design Automation and Test in Europe, 2000, pp.

250-256.

[29] H. Wang, et al., “A Technology-aware and Energy-oriented Topology Exploration

for On-chip Networks,” In Proc. Conf. on Design Automation and Test in Europe,

2005, pp. 1238-1243.

[30] M. Kreutz, et al., “Energy and Latency Evaluation of NoC Topologies,” In Proc.

Int. Symp. on Circuits and Systems, 2005, pp. 5866-5869.

[31] V. George, et al., “The Design of a Low Energy FPGA,” In Proc. Int. Symp. on

Low-Power Electronics and Design, 1999, pp.188-193.

[32] H. Zhang, et al., “Low-Swing On-Chip Signaling Techniques: Effectiveness and

- 150 -

Robustness,” IEEE Trans. VLSI systems, vol. 8, pp. 264-272, June 2000.

[33] AMBATM Specification, Rev. 2.0, 1999, www.arm.com

[34] M. Karol, et al., “Input Versurs Output Queueing on a Space-Division Packet

Switch,” in IEEE Transactions on Communications, vol. 35, no. 12, pp. 1347-1356,

December 1987.

[35] AMBA AXI Protocol Specification, Rev. 0.0, 2003, www.arm.com

[36] J. Duato, et al., “Interconnection Networks, an engineering approach,” Morgan

Kaufmann 2003.

[37] J. Hennessy and D. Patterson, “Computer Architecture: A Quantitative Approach,

3rd edition,” Morgan Kaufmann, p.489.

[38] F. Angiolini, et al., “Contrasting a NoC and a Traditional Interconnect Fabric with

Layout Awareness,” In Proc. Conf. on Design Automation and Test in Europe, 2006.

[39] Y. Moisiadis, et al., “High Performance Level Restoration Circuits for Low-

Power Reduced-swing Interconnection Schemes, ” in Proc. of Int. Conf. on

Electronics Circuits and Systems, Dec. 2000, pp.619-622.

[40] R. Golshan, et al., “A novel reduced swing CMOS BUS interface circuit for high

speed low power VLSI systems,” in Proc. of IEEE Int. Symp. Circuits and Systems,

May 1994, pp.351-354.

[41] Y. Nakagome, et al., “Sub-1-V Swing Internal Bus Architecture for Future Low-

- 151 -

Power ULSI’s,” IEEE J. of Solid-State Circuits, vol. 28, pp. 414 - 419, April 1993.

[42] G. C. Cardarilli, et al., “Low Voltage Swing Circuits for Low Dissipation Buses,”

in Proc. of Int. Symp. on Circuits and Systems, June 1997, pp.1868-1871.

[43] M. Hiraki, et al., “Data-Dependent Logic Swing Internal Bus Architecture for

Ultralow-Power LSI’s, ” IEEE J. of Solid-State Circuits, vol. 30, pp. 397 - 402, April

[44] S.-J. Lee, et al., “Packet-switched on-chip interconnection network for system-

on-chip applications,” IEEE Trans. Circuits and Systems II, vol. 52, pp.308-312, June

[45] S.-J. Lee, et al., “Adaptive network-on-chip with wave-front train serialization

scheme,” in IEEE Symp. on VLSI Circuits Dig. Tech. Papers, June 2005, pp. 104-107.

[46] K. Lee, et al., “Low-Power Network-on-Chip for High-Performance SoC

Design,” IEEE Trans. VLSI Systems, accepted for publication.

[47] K. Lee, et al., “A High-Speed and Lightweight On-Chip Crossbar Scheduler for

On-Chip Interconnection Networks,” in Proc. of IEEE European Solid-State Circuits

Conf., Sept. 2003, pp.453-456.

[48] K. Lee, et al., “A distributed crossbar switch scheduler for on-chip networks,” in

Proc. of IEEE Custom Integrated Circuits Conf., Sept. 2003, pp.671-674.

[49] M. R. Stan, et al., “Bus-Invert Coding for Low-Power I/O,” IEEE Trans. VLSI

- 152 -

systems, vol. 3, pp. 49-58, March 1995.

[50] H. Mehta, et al., “Some Issues in Gray Code Addressing,” in Proc. of Great Lakes

Symp. on VLSI, Mar. 1996, pp.178-181.

[51] L. Benini, et al., “Asymptotic zero-transition activity encoding for address busses

in low-power microprocessor-based systems,” in Proc. of Great Lakes Symp. on VLSI,

March 1997, pp.77-82.

[52] Y. Shin, et al., “Partial Bus-Invert Coding for Power Optimization of System

Level Bus,” in Proc. of Int. Symp. on Low Power Electronics and Design, Aug. 1998,

pp.127-129.

[53] S. Ramprasad, et al., “A Coding Framework for Low-Power Address and Data

Busses,” IEEE Trans. VLSI systems, vol. 7, pp. 212-221, June 1999.

[54] C. Kretzschmar, et al., “Why Transition Coding for Power Minimization of on-

Chip Buses does not work,” in Proc. of the Design Automation and Test Europe Conf.

(DATE), February 2004, pp.512-517.

[55] Y. Shin, et al., “Narrow Bus Encoding for Low-Power DSP Systems,” IEEE Trans.

VLSI systems, vol. 9, pp. 656-660, Oct. 2001.

[56] H. Zhang, et al., “A 1V Heterogeneous Reconfigurable Processor IC for

Baseband Wireless Applications,” IEEE Int. Solid-State Circuits Conf., Feb. 2000, pp.

68-69.

- 153 -

감사의 글

이 학위 논문 표지에는 제 이름만 적혀 있지만, 제가 박사 학위를 받기까지 지난 6년간의 대학원 과정 동안 많은 도움을 주신 분들이 있습니다. 그 모든 분들께 감사의 뜻을 전하고 싶습니다. 우선 석사 2년 반, 박사 3년 반의 짧지 않은 기간 동안 변함없는 열정과 올바른 연구자로서의 모범을 몸소 보이시며 날카롭고 통찰력 있는 조언을 아끼지 않으셨던 유회준 교수님께 진심으로 감사 드립니다. 또한 바쁘신 와중에도 저의 부족한 논문을 주의 깊게 봐주시고 조언을 주셨던 박규호 교수님, 김정호 교수님, 신영수 교수님, 그리고 이혁재 교수님께도 깊은 감사의 말씀을 드립니다. 지난 6년 동안 최고의 연구실을 지향하며 과거/현재/미래를 항상 고민해 왔던 SSL Family에게도 깊은 고마움을 표현하고 싶습니다. 먼저, 사회에 나아가 각자의 높은 꿈을 향해 달려가고 있는 선배님들 – 세정이형, 용하형, 치원이형, 주호형, 진호형, 선호형, 람찬이형, 세중이형, 정훈이형, 재원이형 – 그리고 동기 및 후배들 – 재서, 성은, 진경, 민욱 – 어떤 형태로든 함께 했던 모든 경험과 대화가 저에겐 긍정적이고 건설적인 자극이었습니다. 그리고 잠시였지만 유익한 조언과 경험을 나누어 주셨던 박성민 교수님, 이재열 박사님, 서지선 책임님께도 감사를 드리고 싶습니다. 또한, 현재 서로의 머리를 맞대고 SSL을 힘차게 밀어 가고 있는 연구실 멤버들 – 병규형, (손)주호, 성대, 성준, 교민이형, 정호, 동현, 선영, 남준, 관호, 혜정, 담이, 주영 – 모두에게 깊은 고마움을 전합니다. 또한 궂은 업무들을 훌륭하게 도와주셨던 홍은수씨와 Network-in-Package 제작을 위해서 많은 노력을 함께 해준 김가원씨께도 감사의 말을 전합니다. 마지막으로 큰 사랑을 주시며 항상 걱정해주시는 어머님과 하늘에서 저를 지켜 봐주시는 아버지께 이 논문을 드립니다.

SEE YOU AT THE TOP!

- 154 -

KANGMIN LEE kangmin@eeinfo.kaist.ac.kr

http://ssl.kaist.ac.kr/~kangmin

EDUCATION

Korea Advanced Institute of Science and Technology (KAIST) - Full Scholarship from Korea Government

9/02 － 2/06 Ph.D. in Electrical Engineering Dissertation: Design and Implementation of Low-Power Network-on-Chip for Application to High-Performance System-on-Chip Design

3/00 － 8/02 M.S. in Electrical Engineering Dissertation: Design and Implementation of a 80Gbps Shared Bus Packet Switch using Embedded DRAM

3/96 － 2/00 B.S. in Electrical Engineering － Magna Cum Laude Overall GPA: 3.84/4.30 － Major GPA: 3.80/4.3

University of California Berkeley, CA, US 6/00 － 8/00 Visiting student in Computer Science WORK EXPERIENCE

Korea Advanced Institute of Science and Technology (KAIST) 3/00 － Present Research Assistant － Perform research mainly focusing on various

circuits, architectures, protocols, algorithms and systems design and chip implementation. Major research area includes switches and on-chip interconnection networks.

3/00 － Present Teaching Assistant － Assist teaching for Electronic Laboratories, ASIC and Computer architecture courses

Micrel Lab. (Prof. Luca Benini), University of Bologna, Italy 4/05 － 5/05 Research Assistant － Perform a cross-benchmarking of AMBA

Multilayer Bus and Xpipes Network-on-Chip in aspects of latency, area, power consumption.

SAMSUNG Electronics, Ki-Heong, Korea 1/99 － 2/99 Winter Intern － Intern in Liquid Crystal Display division R&D team.

Research Assistant － Designed and implemented an OSD (On Screen Display) Controller chip for TV by using Mentor CAD System.

- 155 -

RESEARCH PROJECTS

BONE (Basic On-Chip Network) Development of On-Chip Interconnection Packet Switch Network for a SoC 5/04 － Present Leading and managing an On-Chip Network Team 2/03 － 3/04 Responsible for full chip architecture and design of a multimedia

application SoC with low-power on-chip networks [ISSCC2005] 7/02 － 12/02 Responsible for high-speed and light-weight crossbar scheduling

algorithm and its implementation

HOB (Hierarchical Output Buffer) Development of 10Gb-Ethernet 8x8 shared-bus switch fabric with Embedded DRAM

6/01 － 6/02 Responsible for full chip architecture, design and layout including memory and logic

RAMP (RAM Processor) Development of Application Specific Embedded Memory Logic Design Technology

2/00 － 6/00 SRAM Cell Layout, Calibre DRC/LVS Rule File Creation

POPeye (Probe Of Performance) Development of a Simulator for a DRAM Architecture Performance Evaluation on variable System Configuration with Real-life Applications for Windows Platform.

8/00 － 12/00 Responsible for Performance Evaluation and Analysis of a DDR-SDRAM, a Direct-Rambus DRAM, and DDR-Fast-Cycle-RAM.

INTERNATIONAL JOURNAL PAPERS (2 FIRST AUTHORED) TVLSI 2006

Low-Power Network-on-Chip for High-Performance SoC Design Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo. IEEE Transactions on VLSI Systems (Accepted for publication)

D&T Magazine 2005

Analysis and Implementation of Practical Cost-Effective Network-on-chips Se-Joong Lee, Kangmin Lee, and Hoi-Jun Yoo IEEE Design & Test Computers Magazine, Sept-Oct. 2005

TCAS-II 2005

Packet-Switched On-Chip Interconnection Network for System-on-Chip Applications Se-Joong Lee, Kangmin Lee, Seong-Jun Song and Hoi-Jun Yoo IEEE Transactions on Circuits and Systems II, Vol. 52, No. 6, June 2005

- 156 -

JSSC 2002

A Reconfigurable Multilevel Parallel Texture Cache Memory with 75-GB/s Parallel Cache Replacement Bandwidth Se-Jeong Park, Jeong-Su Kim, Ramchan Woo, Se-Joong Lee, Kangmin Lee, Tae-

Hum Yang, Jin-Young Jung and Hoi-Jun Yoo IEEE Journal of Solid-State Circuits Vol. 37, No. 5, May 2002

JSSC 2001

An 80/20-MHz 160-mW Multimedia Processor Integrated With Embedded DRAM, MPEG-4 Accelerator, and 3D Rendering Engine for Mobile Applications Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee,

and Hoi-Jun Yoo IEEE Journal of Solid-State Circuits Vol. 36, No. 11, November 2002

Journals 2001

POPeye: A Simulator for a DRAM Performance Evaluation Kangmin Lee, Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, and Hoi-Jun Yoo IEEK Journal of Semiconductor Technology and Science, Vol. 1, No. 2, June 2001

INTERNATIONAL CONFERENCE PAPERS (9 FIRST AUTHORED) A-SSCC 2005 Outstanding Design Award

Networks-on-Chip and Networks-in-Package for High-Performance SoC Platforms Kangmin Lee, Se-Joong Lee, Donghyun Kim, Kwanho Kim, Gawon Kim, Joungho Kim, and Hoi-Jun Yoo. IEEE Asian Solid-State Circuits Conference (Outstanding Design Award) 2005

ISCAS 2005

An Arbitration Look-Ahead Scheme for Reducing End-to-End Latency in Networks-on-Chip Kwanho Kim, Se-Joong Lee, Kangmin Lee, and Hoi-Jun Yoo. IEEE International Symposium on Circuits and Systems 2005

ISCAS 2005

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on-Chip Donghyun Kim, Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo. IEEE International Symposium on Circuits and Systems 2005

ICCAD 2004

SILENT: Serialized Low Energy Transmission Coding for On-Chip Interconnection Networks Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE International Conference on Computer Aided Design 2004

SOCC 2004

Low Energy Transmission Coding for On-Chip Serial Communications Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE International SOC Conferenc 2004

SOCC 2004

On-Chip Network Based Embedded Core Testing Jong-Sun Kim, Min-Su Hwang, Seungsu Roh, Ja-Young Lee, Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE International SOC Conference 2004

- 157 -

ISSCC 2004

A 51mW 1.6GHz On-Chip Network for Low-Power Heterogeneous SoC Platform

Kangmin Lee, Se-Joong Lee, Sung-Eun Kim, Hye-Mi Choi, Donghyun Kim, Sunyoung Kim, Min-Wuk Lee and Hoi-Jun Yoo IEEE International Solid-State Circuits Conference 2004

CICC 2003

A Distributed On-Chip Crossbar Switch Scheduler for On-Chip Networks Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo

IEEE Custom Integrated Circuits Conference 2003

ESSCIRC 2003

A High-Speed and Lightweight On-Chip Crossbar Switch Scheduler for On-Chip Interconnection Networks Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE European Solid State Circuits Conference 2003

ESSCIRC 2003

A 10Gbps/port 8x8 Shared Bus Switch with embedded DRAM Hierarchical Output Buffer Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE European Solid State Circuits Conference 2003

ISSCC 2003

An 800MHz Star-Connected On-Chip Network for Application to Systems on a Chip Se-Joong Lee, Seong-Jun Song, Kangmin Lee, Jeong-Ho Woo, Sung-Eun Kim, Byeong-Gyu Nam, and Hoi-Jun Yoo IEEE International Solid-State Circuits Conference 2003

GLOBECOM 2002

A Practical Method to use eDRAM in the Shared Bus Switch Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE Global Telecommunications Conference 2002

ISCAS 2001

A Comparative Analysis of a DDR-SDRAM, a D-RDRAM and a DDR-FCRAM using a POPeye Simulator Kangmin Lee, Chi-Weon Yoon, Ramchan-Woo, Jeonghoon Kook, Ja-Il Ku, and Hoi-Jun Yoo IEEE International Symposium on Circuits and Systems 2002

120mW Embedded 3D Graphics Rendering Engine with 64Mb Logically Local Frame Buffer and 3.2GByte/s Run-time Reconfigurable Bus for PDA-chip Ramchan Woo, Chi-Weon Yoon, Jeognhoon Kook, Se-Joong Lee, Kangmin Lee, Yong-Ha Park, and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits 2001

Low Power Motion Compensation Block IP with embedded DRAM Macro for Portable Multimedia Applications Chi-Weon Yoon, Jeognhoon Kook, Ramchan Woo, Se-Joong Lee, Kangmin Lee, and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits 2001

- 158 -

A Reconfigurable Multilevel Parallel Graphics Cache Memory with 75-GB/s Parallel Cache Replacement Bandwidth Se-Jeong Park, Jeong-Su Kim, Ramchan Woo, Se-Joong Lee, Kangmin Lee, Tae-Hum Yang, Jin-Young Jung and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits 2001

ISSCC 2001

80/20-MHz 160-mW Multimedia Processor Integrated With Embedded DRAM, MPEG-4 Accelerator, and 3D Rendering Engine for Mobile Applications

Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, Young-Don Bae, In-Cheol Park, and Hoi-Jun Yoo IEEE International Solid-State Circuits Conference 2001

DOMESTIC PAPERS

Conferences 2001

POPeye: A Simulator for a DRAM Performance Evaluation Kangmin Lee, Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, and Hoi-Jun Yoo Korea Conference on Semiconductors 2001

PATENTS

1. Low Power Crossbar Switch Fabric Kangmin Lee and Hoi-Jun Yoo Korea Patent 10-2004-17745 (pending)

2. Serial Data Transmitter-Receiver And Method Thereof Kangmin Lee and Hoi-Jun Yoo Korea Patent 10-2004-31840 (pending)

AWARDS

1. Outstanding Award at A-SSCC Student Design Contest 2005 Kangmin Lee, Se-Joong Lee, Donghyun Kim, Kwanho Kim, Gawon Kim, Jougnho Kim,

Hoi-Jun Yoo Networks-on-Chip and Networks-in-Package for High-Performance SoC Platforms

2. The Silver Prize at 5th National IC Design Contest 2004 Kangmin Lee, Se-Joong Lee, Donghyun Kim

- 159 -

Design and Implementation of Multimedia SoC using High-Performance On-Chip Network

3. The Best Design Award from Korea Prime-Minister at 3rd National IC Design Contest 2002

Kangmin Lee, Jaeseo Lee Design and Implementation of an 80Gbps Shared-bus Switch with eDRAM

PROFESSIONAL ACTIVITIES

1. Member of Technical Program Committees: DATE 2006 (http://www.date-conference.com/)

2. International Invited Seminars

(1) Dagstuhl Seminar: Power-Aware Computing Systems (http://www.dagstuhl.de/)

(2) IMEC Regular Seminar: Network-on-Chip and Network-in-Package (http://www.imec.be) RESEARCH INTEREST

1. High-Speed and Low-Power On-Chip Interconnection Network Architecture and its Silicon Design for SoC Platform

2. Gigabit Network Switch Design with Embedded DRAM Technology SKILLFUL TOOLS

High-level Simulation: C/C++, SystemC Logic Design: Verilog HDL, Synopsis Design Compiler, Astro P&R Tools Circuit Design: Cadence Opus, Hspice, Synopsys nanosim Layout Art: Cadence Opus, SKILL, Calibre Workstation: UNIX (Solaris OS) On-chip Interconnection Protocols: BONE, AMBA (AHB and AXI), IBM CoreConnect, OCP-IP

LANGUAGE

Korean as a mother tongue, Proficient English, Beginning Japanese, Chinese and Italian

고성능 시스템 온칩용 저전력 네트워크 온칩의 설계 및...

Documents

고성능, 대용량 WORM 스토리지 시스템 · 고성능, 대용량 worm 스토리지 시스템 ... worm 기능을 커널 내부에 구현, 우회 불가 p 백-도어 및 마스터-키

소프트웨어 기반 고성능 침입 탐지 시스템 설계 및 구현myucc.cafe24.com/pdf/세션2/(2)_소프트웨어 기반 고성능 침입 탐지... · Goal A highly-scalable

네트워크 용어집 - download.brother.comdownload.brother.com/welcome/doc002848/cv_mfc825dw_kor_ngy.pdf · 네트워크 용어집 본 네트워크 용어집에서는 Brother 기기의

ITFIND - 중간연구보고서 - RTP - 고성능 장비개발Ⅰ · 2012-06-13 · -2-과학기술처장관귀하 "rtp " .고성능 장비개발 에관한연구의중간보고서를별첨과같이제출합니다Ⅰ

실전 네트워크 보안 모니터링 : 효과적인 네트워크 보안 데이터의 수집, 탐지, 분석

빈틈없는 고성능 네트워크 보안biz.hanatac.com/products/axgate/axgate_b.pdf · 2016-03-24 · 제품 특장점 Multi-Core 병렬처리로 고성능 구현 개별 모듈의

네트워크 소켓 프로그래밍

SoC 저전력 설계 기법

무선 네트워크 (802.11)

네트워크 사용자 설명서 - download.brother.com소개 2 1 네트워크 기능 특징 1 Brother PT-9800PCN 은 다음과 같은 기본적인 네트워크 기능을 제공합니다

그리드 컴퓨팅 - Indico · 2018-11-19 · 슈퍼컴퓨터 vs 클러스터 (1) 노드 (개별 서버) 간 고성능 네트워크 장치로 연결 일반 PC의 이더넷 = 1Gbps

Digital 역 과 전력 저전력 무효전력 계전기 사용 설명서 Digital 역(과)전력 & 저전력 & 무효전력 계전기 사용 설명서 Digital Reverse(Over)power & Underpower

컴퓨터 네트워크 Chapter 05 네트워크 계층과 라우팅 임효택

적응형 네트워크 연결 기업 구축하기 - Bitpipe...백서 02 주요 구성 요소: 고성능 IT 네트워크 현대 사업의 가장 중요한 토대 중 하나는 IT ‘네트워크이다.네트워크는

4. 분자 상호 작용의 네트워크 분석 4.1 네트워크 표현과 계산

개념도 1 (영상지식창작소 란?) - krnet.or.krC0%CC%C0%BA%BC%AD.pdf · ZigBee IEEE802.15.4 유럽 회사 중심으로 256Kbit/s 속도로 저전력 무선 네트워크 10~100m

Ti400,Ti300,Ti200 고성능 열화상 카메라 · 2018-09-28 · 기술 자료 Ti400,Ti300,Ti200 고성능 열화상 카메라 차세대 성능을 겸비한 새로운 장비 LaserSharp

최종 연구개발결과보고서 - ITFIND · - WLAN, WiMAX 재구성형 RF SoC 연구시제품 설계, 개발 다. 저전력 무선통신 단말플랫폼 기술 - 저전력 WLAN/WiMAX

3-6. 멀티IO 고성능 프로세서 모듈

고성능 데이터 베이스 구축을 위한 Oracle Database Appliance (ODA) X6-2 … · Oracle Korea System 고성능 데이터 베이스 구축을 위한 Oracle Database Appliance