View
3
Download
0
Category
Preview:
Citation preview
고성능 시스템 온칩용
저전력 네트워크 온칩의 설계 및 구현
Design and implementation of
Low-Power Network-on-Chip for Application to
High-Performance System-on-Chip Design
Design and implementation of Low-Power Network-on-Chip for Application to
High-Performance System-on-Chip Design
ADVISOR: Professor Yoo, Hoi-Jun By
Kangmin Lee
Department of Electrical Engineering and Computer Science
Division of Electrical Engineering
Korea Advanced Institute of Science and Technology
A THESIS SUBMITTED TO THE FACULTY OF THE KOREA
ADVANCED INSTITUTE OF SCIENCE AND TECHNOLOGY IN
PARTIAL FULFILLMENT OF REQUIREMENTS OF THE DEGREE OF
DOCTOR OF PHILOSOPHY IN THE DEPARTEMENT OF ELECTRICAL
ENGINEERING AND COMPUTER SCIENCE, DIVISION OF
ELECTRICAL ENGINEERING.
DAEJEON, KOREA
2005. 12. 11
APPROBYED BY
__________________
Professor Yoo, Hoi-Jun
고성능 시스템 온칩용
저전력 네트워크 온칩의 설계 및 구현
이 강 민
위 논문은 한국과학기술원 박사학위 논문으로
학위논문 심사위원회에서 심사 통과하였음.
2005 년 12 월 2 일
심사위원장 유회준 (인)
심사위원 박규호 (인)
심사위원 김정호 (인)
심사위원 신영수 (인)
심사위원 이혁재 (인)
i
DEE
20025851
이 강 민. Lee, Kangmin. Design and implementation of Low-Power Network-on-Chip for Application to High-Performance System-on-Chip Design. 고성능 시스템 온칩 용 저전력 네트워크 온칩의 설계 및 구현. Department of Electrical Engineering and Computer Science, Division of Electrical Engineering. 2005. 152p. Advisor Professor Yoo, Hoi-Jun. Text in English
Abstract
A low-power packet-switched Network-on-Chip (NoC) is designed with
hierarchical star topology and implemented in real silicon for possible application to
high-performance SoCs. This dissertation presents how to obtain low power
consumption in NoC while the whole NoC design process is covered from the
architecture decision to the system demonstration.
First, a performance and cost oriented topology exploration is performed. The
evaluated topologies include not only flat topologies such as a bus, mesh, star and
point-to-point but also sixteen hierarchical and heterogeneous topologies. The
evaluation method uses technology-independent analytical models with
implementation-based physical parameters.
Second, the detail network architecture such as switching method, packet
synchronization, link serialization, protocol and buffering schemes are analyzed
with special emphasis on low power consumption. The implemented chip contains
ii
two RISC processors for multiprocessor emulation, two 64kb SRAMs, an on-chip
FPGA, an off-chip gateway for interfacing to outer network, three 4kb SRAMs for
peripheral logic emulation, 1.6GHz PLL for internal clock generation, and on-chip
networks connecting those processing units. On-chip network channel is serialized
from 80bits onto 8bits to reduce the network area and complexity of the network.
Source-synchronous signaling enables plesiochronous communications between
processing units running at different clock frequencies. Low-power consumption is
achieved by adopting various techniques such as lower swing signaling on a global
link, Mux-Tree based round-robin scheduler in a router, crossbar partial activation,
low-energy serial-link coding and clock frequency scaling. The chip consumes less
than 160mW and the on-chip network consumes less than 51mW delivering
11.2GB/s aggregated network bandwidth. The power consumption per bandwidth is
a ninth of the previous study. The 5x5mm2 chip is fabricated with 0.18µm CMOS
process and a system evaluation board demonstrated on multimedia applications
successfully. Multiple NoCs are integrated in a single BGA package to organize
Networks-in-Package (NiP) for large scalable systems with low-cost.
iii
Contents
Abstract
1. Introduction ………………………………………………………………. 1
2. Topology Exploration …………………………………………………….. 6
2.1 Methodology Description 2.2. Energy Exploration 2.3 Area Exploration 2.4 Performance Exploration 2.5 Discussion 2.6 Model Verification with Case-Study Examples 2.7 Summary
3. Network Architecture ………………………………………………...… 47 3.1 Circuit Switching and Packet Switching 3.2 Synchronization 3.3 Serialization 3.4 NoC Protocol
4. Low-Power Techniques …………………………………………………. 59 4.1 Low-swing signaling 4.2 Mux-Tree based Round-Robin Scheduler 4.3 Crossbar Partial Activation Technique 4.4 Low-Energy Coding on On-chip Serial Link
iv
5. Implementation & Measurements …………………………………...… 87 5.1 Design Flow and Methodology 5.2 Chip Implementation and Measurements 5.3 Network-in-Package 5.4 Demonstration System
6. Conclusions ……………………………………………………………… 99 Appendix: BONE-2 Protocol Specification …........................................... 101
Summary …………………………………………………………….……. 144
Bibliography ………………………………………………………….…… 146
Acknowledgement ………………………………………………………… 153
- 1 -
Chapter 1 Introduction
In the nanoelectronics era, System-on-Chip (SoC) design has many opportunities and
many difficulties as well. More than billion transistors are expected to be integrated on
a single chip in this decade with numerous Intellectual Property (IP) blocks and
multiprocessors. According to the International Technology Roadmap for
Semiconductors, before the end of the decade, 50nm CMOS at 10GHz clock speeds
will be readily available [1]. With a chip area of about 400mm2 which equal to the area
of the 64b-Itanium processor, over a thousand microprocessor cores (or modules of
comparable complexity) may be integrated on to a single chip. For the implementation
of these large scale SoCs, one of the most challengeable tasks is the ever-increasing
power consumption. Another challenge is complicated interconnection between
integrated devices.
- 2 -
uP DSP GraphicsEngine FPGA
Memories Peripheral IPsPMU
SW
SWSW
SW
networkinterface
router(switch)
link
individualclock sources
Fig. 1.1 Heterogeneous Network-on-Chip architecture
Wire delays have become critical as compared to gate delays, causing
synchronization problems among IPs. This trend even worsens as the clock
frequencies increase and the feature sizes decrease [1]. Moreover the interconnections
including clock wires spreading over the whole chip area are readily influenced by
process uncertainty or physical disturbance of nanotechnology. The performance of
SoCs will depend on the capability to efficiently interconnect the multiple predefined
and pre-verified IPs in accommodation with their communication requirements [2].
Furthermore, communications among IPs consumes significant portion of overall
system power budget.
Recently, Network-on-Chip (NoC) architectures are emerging as a scalable,
reliable, and highly modular on-chip communication infrastructure for SoC design [3].
- 3 -
The NoC architecture uses layered protocols and packet-switched networks which
consist of on-chip routers (or switches), links, and network interfaces on a predefined
topology, as depicted in Fig. 1.1. The NoC methodology intends to design SoC with
Plug-and-Play fashion akin to the conventional computer Internet thanks to the
modular structure. Instead of interconnecting chip modules at the top-level using an
ad-hoc routing of dedicated global wires, as is done today, a better approach is to
interconnect them by a structured on-chip packet switching network that routes
packets between them. The advantage of using a NoC approach include both
modularity and performance benefits. There have been many architectural and
theoretical studies on NoCs such as design methodology [3], [4], topology exploration
[2], [5], QoS guarantee [6], reliable transmission [7], software issues [8] and test-and-
verifications [23]. However, only a few were implemented and verified on the silicon
and moreover they were not energy-efficient [9], [10]. For large scale NoC
implementations, the power consumption on the network infrastructure should be
minimized in order for reliable transmission with low-cost.
The conventional low-power techniques are devised such as dynamic
voltage/frequency scaling, power supply gating with sleep transistors, clock gating,
bus coding and so on. The object of this thesis is to extending such low-power design
methodologies to the NoC design field and verifying them in NoC design
- 4 -
environments.
In this study [11], I designed and implemented a hierarchically star-connected NoC
with various low-power techniques. The chip contains heterogeneous IPs such as two
RISC processors, multiple memory arrays, FPGA, off-chip network interfaces, and
PLL. The integrated on-chip network provides 11.2GB/s aggregate bandwidth and
consumes 51mW at full traffic condition. In the other hand, the previous work [9]
consumes 264mW with 6.4GB/s bandwidth. The ratio of power consumption to
providing bandwidth of this study is reduced by ten times from the previous works.
Large scale SoCs with huge chip size such as embedded memory logic systems
often suffer from their low yield and high cost problems. To cope with the problems,
System-in-Package (SiP or System-on-Package) techniques are emerged. However, the
current SiP technology focuses on the fabrication process rather than the
interconnecting methodology between the partitioned chips. The NoCs have so
modular structure that large system can be divided into several parts or several chips
to mitigate such problems. In this thesis, four NoCs are mounted on a single chip for
larger system emulation to make Networks-in-Package (NiP) with the seamless NoC
protocol, a new family of SiP.
The organization of this thesis is as follows. Multiple basic and hierarchical
topologies are evaluated in aspects of performance and cost in section 2. The NoC
- 5 -
architecture will be discussed in Section 3, and the low-power techniques will be
presented in Section 4. The implementation and measurements results and
demonstration of the NoC and NiP will be followed in Section 5, and finally, the
conclusion of our work will be summarized in Section 6.
- 6 -
Chapter 2 Topology Exploration
Recently, the state-of-the-art chips are integrating multiple processing cores to scale
up their performance rather than accelerating a single core with a faster clock [25].
This trend can be observed not only in high-end servers and desktop computers but
also in diverse embedded applications such as entertainment devices and mobile
terminals [26-27]. As the trend is accelerating more and more, the on-chip
communication infrastructure between cores necessitates higher bandwidth and more
scalable architecture rather than a conventional shared-bus structure. To cope with
these requirements, a concept of Network-on-Chip (NoC) was proposed [28] and
implemented [11] as a packet switching interconnection network with a scalable
topology. The NoC is well-known to provide sufficient bandwidth and throughput by
using non-blocking switching fabrics and packet-multiplexing channel. However, the
- 7 -
power and area cost of the on-chip network have not been clearly examined in
comparison with that of the conventional bus architecture. Moreover, the performance,
power and area cost are strongly dependent on the network topology [29]. Therefore it
is crucial to choose the optimal topology which meets the performance requirements
and the energy and area budget.
There have been researches on the topology exploration for NoCs. Murali et al.
developed a tool for automatically selecting an application-specific topology as
minimizing average communication delay, area and power dissipation [5]. Wang et al.
presented a technology-aware topology exploration of various meshes/tori [29].
Kreutz et al. presented a topology evaluation engine based on heuristic optimization
algorithm [30].
In these prior works, the candidate pool of the topologies was limited to the regular
and homogeneous topologies like a mesh, torus, cube, tree or multistage network.
Moreover a comparison with the conventional bus architecture was not sufficiently
studied, which is the practical concern of the field engineers. In the heterogeneous
SoCs like embedded or mobile systems, however, the communication flows are
certainly localized, not uniformly distributed [2-3]. In this case, it is highly possible
that the optimal topology can be a heterogeneous and hierarchical topology rather than
a homogeneous and flat topology. For example, George et al. proposed a hybrid
- 8 -
interconnecting structure incorporating locally point-to-point, semi-globally mesh and
globally-tree for a low-power FPGA application [31] which lead to the Pleiades chip
implementation [56]. Therefore we need to investigate such hierarchical and
heterogeneous topologies in more detail.
In this study, I present an analytic methodology to predict the performance, energy
and area cost of various topologies – not only basic topologies including bus, mesh,
star and point-to-point topologies but also hierarchical and hybrid topologies, for
example, a hierarchical bus, local-star global-mesh or local-bus global-star topology.
In this analysis, analytical models are proposed and physical parameters depending on
the process technologies and circuit designs are used. This work reveals the detailed
relationship of the performance/energy/area characteristics with the number of
integrated cores and various traffic patterns.
This paper organized as follows. Section 3 presents the candidate topologies to be
explored and the exploration method including the performance and cost model and a
traffic model. In section 4, 5 and 6, the energy, area and performance properties of the
topologies are examined, respectively. In section 7, the results from the proposed
models are compared with real implementation results for the validation of the
proposed models. In section 8, the most performance and cost-efficient topology is
discussed. Finally the paper concludes with section 9.
- 9 -
2.1 Methodology Description
2.1.1 Topology Pool
Topologies are categorized into two groups in this analysis: flat topologies and
hierarchical topologies. Fig. 2.1 shows flat topologies such as a bus, star, mesh, and
point-to-point and also hierarchical topologies, for example local-bus global-star,
local-star global-mesh and local-star global-star. The hierarchical topologies consist
of a local and global network topology where the local and global network can have
any type of the basic topologies. I comparatively analyze the four flat topologies and
the sixteen hierarchical topologies.
Fig. 2.1: Topology pool: (a) ~ (d) basic flat topologies and (e) ~ (g) hierarchical topologies as examples.
- 10 -
2.1.2 Assumptions
I assume that the size of each processing element (PE) is uniform as 1mm x 1mm
and the PUs are placed as a square matrix regardless of the topology as shown in Fig.
2.1. A PE could be a single processor or a sub-system such as a multimedia accelerator,
a memory system or an external interface. Each PE can behave as a master (initiator)
or a slave (target) depending on its operation. The number of PEs, N, scales from 16 to
100. The hierarchical topology is assumed to be divided into N of clusters and each
cluster contains N of PEs.
The bus and point-to-point topologies don’t have internal data buffers in their
interconnection networks but could have them in their interface. Meanwhile the star
and mesh topologies have internal packet buffers in their every switching hop. The
buffer capacity in each switch is determined by considering flow control mechanism
and congestion level of the switch. The transaction unit is a packet which is composed
of 16bit header and 64bit payload (32bit address and 32bit data). The packet is
serialized onto a unidirectional 10-bit link which consists of 8-bit packet signals, a 1-
bit STROBE signal as a timing reference and a 1-bit End-Of-Packet signal [11]. I
assume that the clocks of the integrated PEs are plesiochronous i.e. they have different
frequencies of their own and are not synchronized each other.
- 11 -
2.1.3 Traffic Model
There are two kinds of traffic patterns; one is uniform random traffic and the other
is localized traffic with a locality factor, α, which value is between 0 and 1. The
locality factor means a ratio of the intra-cluster traffic to the overall traffic as
illustrated in Fig. 2.2. As the α gets close to 1, the traffic becomes highly localized i.e.
most of transactions occur within an intra-cluster domain. If the α is 0.5, a half of the
traffic is intra-cluster domain and the other half is inter-cluster domain. It is obvious
that PEs with low latency and large bandwidth communication can get more
synergetic performance by placing them in the same cluster based on their
communication locality so that the intra-cluster traffic becomes dominant. In such a
heterogeneous system, the locality factor can represent the localized traffic pattern
quantitatively.
Fig. 2.2: Locality Factor.
- 12 -
2.1.4 Energy, Area and Latency Models for Networks-on-Chip
I use an average packet traversal energy Epkt as a network energy efficiency metric
which can be estimated by the following equation, summing up the energies on
switching hops, links and a final destination buffer [29].
QueueLinkAvgAvgSFAvgARBQueueAvgpkt EELSSESSEEHE +⋅+⋅+⋅+⋅= )( (1)
where HAvg and LAvg are average hop counts and an average distance, respectively,
between a sender PE and a receiver PE. SSAvg is an average switch size i.e. a number of
I/O ports in a switch. Energy consumption on a switching hop is composed of energy
consumption in an input queuing buffer or latch, EQueue, switching fabric, ESF and
arbitration logic, EARB. ELink stands for transmission energy on a unit-length link. Those
energy terms are measured from the circuit implementation in 0.18µm technology as
shown in Table 2.1.
The area cost of a network can be derived by summing up the area of switches and
links.
LinkTotAvgSFAvgARBAvgQueueTotTot ALSSASSASSAHA ⋅+⋅+⋅+⋅⋅= )( 2 (2)
where HTot and LTot are total hop counts and total link length on the network,
respectively. The physical area of a queuing buffer, an arbiter and a switch fabric is
measured from the real circuit layout in 0.18µm technology (See Table 2.1).
- 13 -
The latency through the network can be derived by accumulating the hop delay and
link delay.
LinkAvgSFARBQueueSyncAvgLatency TLTTTTHT ⋅++++⋅= )( (3)
where the hop delay is the sum of the signal synchronization delay (TSync) which
occurs when a packet traverses from switch to switch, the queuing delay (TQueue), the
arbitration delay (TARB) and the switching fabric delay (TSF). The design and
technology dependent parameters are measured from the post-layout simulation as
shown in Table 2.1.
TABLE 2.1: PHYSICAL PARAMETERS IN 0.18µm CMOS TECHNOLOGY [11].
Category Description Typical value Sym.
Buffer (write/read) 1.97 x 10-10 EQueue
Switching fabric/port 6.25 x 10-12 ESF
2:1 multiplexer 3.04 x 10-12 EMUX
Arbitration/port 1.79 x 10-13 EARB
1-mm link 4.38 x 10-11 ELink
Energy (J)
/ 1-packet
1-mm link (P-to-P)1 8.76 x 10-11 ELink_PtP
1 The point-to-point topology consumes much more metal routing resources than other topologies do. Therefore upper metal layers should be fully used. This situation increases the wire metal coupling capacitance vertically, thus the link energy consumption also increases.
- 14 -
Processing Element (PE) 1 x 1mm2 (LPE)2 Dimension
(mm2) Cluster Unit 44 NN × mm2 (LCU)2
3-packet queuing buffer 8.40 x 104 AQueue
Crossbar-fabric 1.47 x 103 × (# of s/w
ports)2 ASF
M:1 10b-Multiplexer 9.52 x 102 × (M-1) AMUX
Arbitration logic 2.70 x 103 × (# of s/w ports) AARB
Area (µm2)
20b 1-mm link 3.80 x 104 ALink
Arbiter 1/3 x log2(# of s/w ports) TARB
Switching fabric 6.5 x 10-3 x (# of s/w ports)2 TSF Latency (ns)
1-mm link (repeated
link) 0.42 TLink
- 15 -
2.2. Energy Exploration
2.2.1 Bus Topology
A conventional Mux-based bus structure [33] has two unidirectional buses; one is
from masters to slaves and the other is from slaves to masters as shown in Fig. 3(a). In
this master/slave bus topology, direct interconnections between two masters are
impossible thus they share a memory to communicate with each other. Meanwhile, the
master/slave bus cannot support the direct message passing between two masters. To
cope with the limitation of connectivity, a fully connected bus can be used as
illustrated in Fig. 2.3(b). It has a single shared bus which connects all of the PEs
regardless of the types of master/slave. Fig. 2.3(c) shows a hierarchical bus structure
in which local buses are master/slave buses and a global bus is a fully-connected bus.
By using the fully-connected bus in the global network instead of the master/slave bus,
direct access between two clusters is possible without a shared memory.
Fig. 2.3: Bus topologies.
- 16 -
The following equations show the Epkt of each bus in Fig. 2.3.
(a) Master/slave bus: (a # of masters: N/2, a # of slaves: N/2)
QueuePELinkMUXARBMSBpkt ELNNENENEE +
⎥⎥⎦
⎤
⎢⎢⎣
⎡⋅
⎭⎬⎫
⎩⎨⎧
+⋅+⎭⎬⎫
⎩⎨⎧ −⋅+⋅=
2221)1
2(
2
(1)
The multiplexer and arbiter on the M S bus have N/2 inputs. The average distance
from a master to the multiplexer can be derived as PELN ⋅2/2/1 and the length of
the shared-bus is PELN ⋅2/ .
(b) Fully connected bus
QueuePELinkMUXARBFBpkt ELNENENEE +⎥⎦
⎤⎢⎣⎡ ⋅
−⋅+−⋅+⋅=
2)1(3)1(
(2)
The multiplexer and arbiter on the bus have N inputs. The average distance from a
master to the last multiplexer is PELN ⋅− 2/)1( and the length of the share-bus
is PELN ⋅− )1( .
(c) Hierarchical bus
QueueFB
GlobalpktMSB
LocalpktMSB
LocalpktHBpkt EEEEE +−×+⋅+⋅= )1()2( ___ αα
(3)
The EHBpkt can be derived by summing the local traverse energy and global traverse
- 17 -
energy according to the traffic locality factor. The local and global traverse energy can
be obtained from equation (1) and (2), respectively, by replacing N with N .
x 10-9 [J]Uniform
= 0.3
= 0.5
= 0.7
= 0.9
N
Epkt Traffic
20 30 40 50 60 70 80 90 1000
0.5
1.0
1.5
2.0
2.5
3.0Fully-connected bus
Hierarchical BusMaster/slave bus
Fig. 2.4: Energy comparison of three bus topologies.
Fig. 2.4 shows the Epkt of the three buses according to the number of PEs with various
traffic patterns. Under uniform traffic, the flat master/slave bus outperforms the
hierarchical-bus. However, as the traffic gets localized, i.e. realistic, the energy
consumption of the hierarchical bus is significantly reduced. As I expected, a
hierarchical topology has the best energy efficiency rather than flat bus topologies
under the localized traffic.
2.2.2 Mesh Topology
The following equation shows the Epkt of a 2-D flat mesh.
[ ] QueuePULinkAvgSFARBQueueFMpkt ELESSEEENNE +⋅+⋅++×⎟
⎠⎞
⎜⎝⎛ −⋅+⋅= )()1(
32
32 4 αα
(4)
- 18 -
In the equation, 43/2 N⋅ and N⋅3/2 are average hop counts of local and
global transactions, respectively.
Fig. 2.5: Energy consumption of a mesh.
Fig. 2.5(a) shows the EFMpkt compared with bus topologies. Under uniform traffic, the
mesh topology shows better energy efficiency than the flat bus does when N is larger
than 36. As the traffic gets localized, the energy consumption of the mesh decreases
but it is still higher than that of the hierarchical-bus topology. The mesh topology
shows flatter slope than bus topologies do as the size of network increases. The mesh
topology is known to be more scalable than the bus topology. This work reveals the
trends with quantitative figures in aspect of energy consumption.
It is also interesting to compare the energy consumption between hops (switches) and
links as shown in Fig. 2.5(b). The energies on hops are much higher than the energies
on links, about 5 ~ 8 times, in this implementation condition.
- 19 -
2.2.3 Star Topology
In a star topology, the hop count is always 1 and every transaction goes through the
central crossbar switch. The following equation represents the Epkt of a flat star
topology.
( ) ( ) QueuePULinkSFARBQueueFSpkt ELNENENEEE +×−×+⋅+⋅+×= 21 (5)
The central switch has a number of N I/O ports and the average distance between two
PEs via the central switch is 2−N . The energy of a hierarchical star is given as the
following equation.
)1()( ___ αα −⋅++⋅= SGlobalpkt
SLocalpkt
SLocalpkt
HSpkt EEEE (6)
The local/global energy, ESpkt_Local and ES
pkt_Global, can be obtained from (5), by
replacing N with N . In case of global-network, the LPE should be also replaced by
LCU.
Fig. 2.6: Energy comparison of star topologies.
- 20 -
Fig. 2.6(a) shows the comparison of EFSpkt and EHS
pkt. When the traffic is less
localized and the N is less than a few tens, the flat-star topology shows higher energy
efficiency because of its less hop-count than the hierarchical-star topology. However,
as the traffic gets localized (α > 0.7), the hierarchical-star outperforms the flat-star
topology. Moreover the hierarchical-star shows very flat energy profile with the
increasing network size so that it is highly scalable in terms of the energy-cost. Fig.
2.6(b) shows the energy comparison in switches and links.
2.2.4 Point-to-Point Topology
Point-to-point topology has a dedicated link between each pair of PEs as shown in
Fig. 2.1(d). It provides the shortest link length without any intermediate switches
between a sender and a receiver, thus it shows lowest energy consumption among the
other topologies. However it suffers from the huge link area and large number of
input/output ports. (The area cost will be discussed in section 5.) The following
equation describes the Epkt of the flat point-to-point topology.
QueuePUPtPLink
MUXARBFPpkt
ELNNE
NENEE
+⋅⎭⎬⎫
⎩⎨⎧
−⋅⎟⎠⎞
⎜⎝⎛+⋅⎟
⎠⎞
⎜⎝⎛⋅+
×−⋅+−⋅=
)1(32
32
2)2()1(
4_ αα
(7)
Each PE has 1:(N-1) de-multiplexer and (N-1):1 multiplexer in its input/output port,
respectively. The average link length of a point-to-point topology is the same as that of
a mesh topology which is the shortest Manhattan distance.
- 21 -
2.2.5 Heterogeneous Topologies
In the previous sections, I have analyzed basic topologies such as bus, mesh, star
and point-to-point topology and some hierarchical and homogeneous topologies such
as hierarchical bus and hierarchical star topologies. It was found that the hierarchical
topologies perform the better energy efficiency and scalability than the flat topologies
do. Therefore it is worthwhile to examine the energy efficiency of other hierarchical
and heterogeneous topologies, for example, a local-star global-mesh or a local-bus
global-star topology. In order to consider such hierarchical and heterogeneous
topologies, I evaluate the basic topologies in two hierarchical domains; a local (intra-
cluster) and a global (inter-cluster) domain.
Fig. 2.7: Energy cost in a (a) local and (b) global network.
Fig. 2.7 shows the comparison of the energy efficiency in a local and global
network separately. In a local network, the energy cost on a link is much lower than
- 22 -
that in a switch because of the shorter distance between communicating nodes. Since a
mesh topology has larger number of hops than others, it shows the highest energy cost
than the others. On the other side, in a global network, the energy cost on a link
becomes significant. As a result the energy cost of the bus topology, which has the
longest wires, gets worst. Considering both of the local and global networks, the point-
to-point topology is the most energy efficient and the star topology is the next.
All of the topologies are compared under various traffic conditions as shown in Fig.
2.8. In any traffic condition, the point-to-point topologies show the best energy
efficiency. If the point-to-point topologies can not be adopted due to its infeasibility,
the performance of star topologies is the best among the others. If N is fixed to 36 (See
Fig. 2.8(c)), for instance, the flat star is the best for less localized traffic while the
hierarchical star (L-star G-star) is the best for more localized traffic (α > 0.5). The
mesh consumes 30~80% more energy than the hierarchical star does. The mesh
outperforms the hierarchical bus about 10~20% at less localized traffic, but the
hierarchical bus shows much better energy efficiency than the mesh at highly localized
traffic.
- 23 -
Fig. 2.8: The energy cost comparison under (a) uniform, (b) localized and (c) varying traffic condition.
- 24 -
2.3 Area Exploration
I also analyze the area cost of on-chip networks which is one of the most important
practical issues, but it has not been considered in prior works.
Fig. 2.9(a), (b) shows the area cost of the basic topologies in a local and global
network domain. As we expected, the area cost of the point-to-point topology is
skyrocketing so that it is not feasible to be implemented on a chip. A bus topology
shows the lowest area cost. Interestingly, the star topology occupies almost the same
area as the bus in a local network and consumes slightly lower area than the mesh does
in a global network. Fig. 2.9(c) shows the area comparison of all of the topologies.
The hierarchical bus topology shows the lowest area cost as we expected. However,
local-bus global-star/mesh and local-star global-star/mesh topologies also occupies as
little area as the hierarchical bus does. This is because the area of total network
strongly depends on local networks rather than a global network.
- 25 -
4 5 6 7 8 9 10
Local Area[mm2]BusStarMeshPoint-Point
4 5 6 7 8 9 10
Global Area[mm2]
NCLT
NPE0
1
2
3
4
5
0
10
20
30
40
50
60Total Network Area
L-PtP G-PtPL-PtP G-Star
L-Mesh G-Star
Flat Star
Flat Mesh
L-Star G-MeshL-Star G-StarL-Bus G-MeshL-Bus G-StarL-Bus G-Bus
[mm2]
N20 30 40 50 60 70 80 90 1000
1
2
3
4
5
(b)
(a)
(c)
Fig. 2.9: Area cost in a (a) local, (b) global network and (c) hierarchical topologies.
- 26 -
2.4 Performance Exploration
The communication throughput, latency and the maximum achievable frequency
strongly depends on the topology of the network. In this section, I look into the
performance comparison among the hierarchical topologies and basic topologies as
well.
2.4.1 Average Throughput of a PE
The throughput degradation of the packet switched (i.e. pipelined) network primarily
occurs because different flows2 share an intermediate link as illustrated in Fig. 2.10.
Furthermore, if the switch uses FIFO input queues, the head-of-line (HOL) blocking
phenomena limit the maximum achievable throughput on a link further [34]. In this
subsection, an average injection throughput of a source PE is analyzed by accounting
the two throughput degradation in each topology.
Fig. 2.10: Throughput degradation due to the link sharing and HOL-blocking.
2 A flow means a unique routing path from a source node to a destination node.
- 27 -
A. Point-to-Point Topology
Since the point-to-point topology has a dedicated link for each flow, there are not
shared links and shared queues either. Thus there is no throughput degradation on the
point-to-point network.
B. Bus Topology
Since all the PEs share a single link (bus), the bus bandwidth is equally divided to
each PE. Thus the average injection-throughput of a PE is reduced to 1/(the number of
PEs).
C. Star Topology
In the star topology, there is no shared link. However there exists HOL blocking when
a PE is enable to issue multiple outstanding addresses3. The throughput degradation
due to HOL blocking is derived in many literatures under uniform i.i.d. traffic [34].
The throughput decreases as the number of PE increases and is saturated to 0.58 when
the number of PEs gets larger than sixteen.
D. Mesh Topology
The throughput of the mesh topology is more complicated to be derived because there
are multiple shared links and the throughput strongly depends on the routing algorithm.
For the simple approach, I first derived the maximum number of flows on an 3 The ability to issue multiple outstanding addresses means that PE can issue new transaction addresses without waiting for earlier transactions to complete. Because it enables parallel processing of transactions, this feature can improve system performance [13].
- 28 -
intermediate link (NFOIL) and on a PE-link (NFOPL) as illustrated in Fig. 2.11. I applied
a dimension-order routing algorithm [36].
)1(61 2 +nn
12 −n
Fig. 2.11: The number of flows on a link in 3 x 3 mesh topology.
Since NFOIL and NFOPL of flows share an intermediate link and a PE-link, respectively,
the aggregated throughput of the flows could not exceed the link throughput. If the
throughput of a flow is denoted as ρflow and the maximum throughput of a link limited
by HOL blocking is denoted as ρlink_HOL, the following equation should be satisfied.
( ) HOLlinkflowFODLFOIL NNMax _, ρρ ≤× (8)
Therefore the average injection-throughput of a PE, ρPE, can be derived as the
following equation.
HOLlinkflowPE nnn _2
2 1,)1(6min)1( ρρρ ×⎟⎠⎞
⎜⎝⎛ −
≤×−= (9)
where the ρlink_HOL is a function of the switch size, i.e. the number of input ports of a
- 29 -
switch [34].
Fig. 2.12 shows the comparison of PE-injection-throughput versus the number of PEs
in the basic topologies under i.i.d. uniform traffic condition. The star and mesh
topology shows competitive throughput meanwhile the point-to-point and bus
topology performs the best and the worst, respectively. When the number of PEs is
less than 20, the mesh shows better throughput than a star does because of less HOL
blocking probability of smaller size switches in the mesh. However, as the network
size gets larger, the intensive sharing of intermediate links degrades the network
throughput.
In the on-chip communications, the average injection-throughput of a PE may not be
higher than 0.5 in general because the PE needs internal processing latencies out of
load/store instructions4. By this estimation, the star guarantees good bandwidth in any
size of network. Although the mesh topology shows throughput degradation as the
network size gets bigger than few tens, the throughput seems still acceptable to
general systems.
4 The injection-throughput of 0.5 means that the PE operates load/store instruction in every second clock cycle.
- 30 -
MESH
BUS
STAR
PtP
0
0.10.2
0.30.4
0.5
0.60.7
0.8
0.91
0 10 20 30 40 50 60 70 80 90 100Number of PEs
Limited by destination link
Limited by intermediate link
Fig. 2.12: Injection-throughput of a PE in basic topologies.
E. Hierarchical Topologies
The hierarchical topology is consists of local network and global network as shown in
Fig. 2.13.
PEρPEρ
HOLlink _ρ≤
nPE ×⋅− ρα )1(
1≤
Fig. 2.13: Throughput in a hierarchical topology.
- 31 -
E-1. Bus as a local network
The local bus network should deal with the aggregated throughput from each PE and
the gateway. Therefore the following equation should be satisfied.
)2(1
1)1(
αρ
ρραρ
−≤∴
=≤×⋅−+×
n
nn
PE
BUSPEPE
(10)
E-2. Star as a local network
The throughput on a gateway-link and a PE-link in a local star network is limited by
the HOL blocking.
⎟⎟⎠
⎞⎜⎜⎝
⎛−
×≤∴
≤
≤×⋅−
1,)1(
1min
)
)1()
_
_
_
αρρ
ρρ
ρρα
n
b
na
HOLlinkPE
HOLlinkPE
HOLlinkPE
(11)
E-3. Mesh as a local network
The inter-cluster flows (NFOIL_GW) and intra-cluster flows (NFOIL_LOCAL) share an
intermediate link as illustrated in Fig. 2.13. The maximum NFOIL_GW and NFOIL_LOCAL are
derived as:
nnN GWFOIL ⋅⎥⎥
⎤⎢⎢
⎡=
2_ )1(61
_ += nnN LOCALFOIL (12)
Therefore the throughput on the intermediate link is calculated as the sum of intra-
- 32 -
cluster flows and inter-cluster flows as the following equation. The link throughput is
limited by the HOL blocking.
( ) HOLlinkPEGWFOILflowGWFOILLOCALFOIL NNN ____ )1( ραραρ ≤−××+××− (13)
HOLlinkPE
HOLlinkGWFOILLOCALFOIL
flowPE
and
NnNnn
_
___ )1)1((
)1()1(
ρρ
ραα
ρρ
≤
×⋅−−×+⋅
−≤×−=∴
(14)
E-4. Bus as a global network
The aggregated throughput of inter-cluster flows on a global bus is less than or equal
to 1.
22
)1(11)1(
nn PEPE ⋅−
≤∴≤×⋅−α
ρρα (15)
Gateway
LN
Gateway
LN
Gateway
LN
Gateway
LN
nPE ×⋅− ρα )1(
Global bus
a number of n clusters
2)1( nPE ×⋅− ρα
Fig. 2.14: Throughput in a global bus topology.
- 33 -
E-5. Star as a global network
The analogy is the same as the local star network.
⎟⎟⎠
⎞⎜⎜⎝
⎛−
×≤∴
≤×⋅−
1,)1(
1min
)1(
_
)(_
αρρ
ρρα
n
n
HOLlinkPE
nHOLlinkPE
(16)
In the global star topology, there is only a single link between the local and global
switches thus it limits the overall throughput. To cope with it, I can double the inter-
cluster link like a fat-tree as shown in Fig. 2.15. Then, the throughput figure will be
given by the following equation.
⎟⎟⎠
⎞⎜⎜⎝
⎛−
×≤∴
≤×⋅−
1,)1(
2min'
2)1(
_
)(_
αρρ
ρρα
n
n
HOLlinkPE
nHOLlinkPE
(17)
Where the link throughput,HOLlink _
'ρ , is lessen from the HOLlink _
ρ from equation
(16) because the size of the global switch increases by double.
- 34 -
Fig. 2.15: hierarchical star topology with double link between local and global networks.
E-6. Mesh as a global network
Do the same analogy as (8) by only replacing the ρflow with ρflow(1-α)n.
( ) HOLlinkflowGLOBALFODLGLOBALFOIL nNNMax ___ )1(, ραρ ≤⋅−⋅× (18)
where NFOIL_GLOBAL = )1(61
+nn , NFOPL_GLOBAL = 1−n (19)
Thus, ( )GLOBALFODLGLOBALFOIL
HOLlinkflowPE NNMaxn
nn
__
_
,)1( ×−⋅
×≤×=
αρ
ρρ (20)
Fig. 2.16(a) shows the PE-injection-throughput comparison in a local network where
the number of PEs is limited up to 10. The local mesh topology provides the best
throughput. Meanwhile, the throughput in star topology gets reduced as the size of
network increases because of the limited bandwidth on the gateway-link between local
and global network. By doubling the gateway-link, the throughput of the local star
- 35 -
becomes similar as that of local mesh. As the traffic gets localized, the local network
throughput increases as shown in Fig. 2.16(b). Under the highly localized traffic, mesh
and star provide similar throughput figure.
Fig. 2.16: PE-injection-throughput in a local network.
Fig. 2.17(a) shows the PE-injection-throughput comparison in a global network where
the number of clusters (n) is limited up to 10 and the number of PEs in a cluster is n .
When the number of clusters is less than or equal to 5, the star and mesh topology
don’t suffer from the throughput degradation. However, as the network size gets
bigger, the throughput of a mesh and star topology is degraded by excessive link
sharing. By doubling the global link like the star2, the degradation disappears.
- 36 -
Fig. 2.17: Throughput in a global network.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
10 20 30 40 50 60 70 80 90 100
lbgb
lbgs lbgm
lsgs lsgmlmgs
lbgs2
lmgs2 lsgs2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
lbgb
lbgs lbgmlbgs2
lsgslsgm
lmgs
lmgs2lsgs2
Flat Mesh
Inje
ctio
n-Th
roug
hput
of a
PE
Number of PEs (n2) Locality factor ( )(a) (b)
= 0.8 n2 = 36
lmgm
Fig. 2.18: Throughput in hierarchical topologies.5
Fig. 2.18 shows the overall throughput of the hierarchical and heterogeneous
topologies. First, the bus topology in local or global network limits the overall
throughput severely as we expected. Interestingly, the flat mesh topology performs the
best in terms of the network throughput. Thought the lmgs2 and lsgs2 shows lower
5 Notation of the items: l (local), g (global), m (mesh), s (star), s2 (star with double global links), b (bus)
- 37 -
throughput under less-localized traffic, they are still acceptable in the general systems.
By using the proposed analytical method, we can estimate the throughput figures of
hierarchical topologies and understand the trends of each topology.
2.4.2 Average Packet Traversal Latency
The latency through the network can be derived by accumulating the hop delay and
link delay.
LinkAvgHopAvgLatency TLTHT ⋅+⋅=
where SFARBQueueSyncHop TTTTT +++= (21)
Late
ncy
in a
hop
[nse
c]
Max
. Fre
quen
cy o
f a sw
itch
[GH
z]
Fig. 2.19: Latency in a hop-switch and the maximum achievable clock frequency.
Fig. 2.19 shows latency in a hop-switch and its maximum achievable clock frequency
versus the number of I/O ports in a switch up to 10. The queue-waiting-latency (Tqueue)
- 38 -
is caused output-port conflict and the HOL blocking in a FIFO queue. The clock
frequency of a switch is determined by the arbitration logic speed. It saturates around
450MHz even though the switch has 100 I/O ports when the arbiter is implemented by
Mux-Tree circuits [11]. With the switch-latency characteristics, I evaluate the latency
of various hierarchical and heterogeneous topologies as shown in Fig. 2.20.
The average packet traversal latency figures are similar as the average packet traversal
energy in Fig. 2.8 since the latency is also accumulated on each hop and link like the
energy consumption does. The evaluation results show that the hierarchical star shows
the lowest latency and lsgm is the next. The flat mesh topology experience almost
twice latency than the hierarchical topology does. The packet latency in a flat star
topology is skyrocketing due to its huge switching-fabric-latency.
Fig. 2.20: Average packet traversal latency.
- 39 -
2.5 Discussion
2.5.1 Performance (Throughput and Latency)
The best throughput is provided by the flat mesh topology. But the flat mesh
topology suffers from its long latency (twice than H-star) and larger area cost (three
times than H-star). Furthermore, when the traffic gets localized, the hierarchical and
heterogeneous topologies perform the competitive throughput with the flat mesh
topology. In many embedded systems, the required average injection-throughput of a
PE is not so high. For example, if a PE is a general purpose microprocessor with a
cache memory, the miss rate is less than 0.01 in the conventional benchmark
applications [37]. Thus the PE-injection-throughput (or the cache-block replacing
traffic) will be less than 0.1. With this throughput requirement, most of the topologies
except the bus topologies meet the throughput specification.
On the other side, the network latency could be more critical issue in real-time
applications. Moreover latency reduction directly improves the system performance.
In the aspect of the network latency, the hierarchical star topology performs the best
than others because it has less hop counts.
2.5.2 Cost (Energy and Area)
I have analyzed the energy and area cost of the flat and hierarchical topologies with
various network sizes and traffic patterns. The energy cost of the point-to-point
- 40 -
topologies is the lowest, but its area cost is too high to be implemented. Although the
hierarchical bus shows the lowest area cost, its energy cost increases more rapidly
with the increasing network size than other hierarchical topologies. According to our
analysis, the hierarchical or multilayer bus can not be used when the number of
integrated cores is larger than 25. The flat mesh topology shows better energy
efficiency than the hierarchical bus only when the traffic is uniformly distributed.
However, as the traffic gets localized, the energy cost of the mesh doesn’t scale down
as much as other hierarchical topologies do. Moreover, the area cost of the mesh is
usually three times larger than that of other hierarchical topologies. If you don’t take
the point-to-point topologies into account, a flat star topology shows the best energy
efficiency when the traffic is uniform and the network size is less than 80. However it
shows worse energy efficiency as the traffic is localized and the network size gets
bigger.
The hierarchical star (local-star global-star) topology is the most cost-efficient and
scalable topology for the heterogeneous systems where the traffic is localized. The
energy cost is the lowest among hierarchical topologies and the area cost is also
comparable with the hierarchical bus.
- 41 -
20 30 40 50 60 70 80 90 1000
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0Energy2 x Area (uniform traffic)
N20 30 40 50 60 70 80 90 100
N
Energy2 x Area (a = 0.80)
(a) (b)
L-Bus G-Mesh
Multilayer Bus
Flat MeshFlat StarFlat PtP
L-Star G-StarL-Star G-Mesh
L-Bus G-StarL-Mesh G-Star
L-Mesh G-StarFlat MeshL-PtP G-PtP
Fig. 2.21: Energy2 and area product
Fig. 2.21 shows the energy2 and area product as a single cost-efficiency metric which
is normalized to the hierarchical bus. Under the uniform traffic and also localized
traffic, the hierarchical star shows the best cost-efficiency. As a summary, Fig. 2.22
shows the cost distribution of selected topologies with traffic and network size
variations. The height and the slope of the distributed area represent the energy
variation according to the traffic variation and the network size, respectively. The
hierarchical star topology shows less energy variation than the hierarchical bus and the
mesh. It also shows less area variation than the mesh and the star topology. Finally, the
- 42 -
hierarchical star topology is located in the left lower corner side which means that it is
the most cost-effective topology among others. In order to increase the throughput of
the hierarchical star topology, the global link can be doubled as shown in the Fig. 2.15.
In this case, the area overhead is 20% which is not significant compared with the
throughput enhancement.
Fig. 2.22: Energy and Area Cost distributions with traffic and the network size variations.
- 43 -
2.6 Model Verification with Case-Study Examples
For the validation of the analytical performance and cost models proposed in this
paper, I compared the analytic results with a layout-aware simulation results in two
topologies as shown in Fig. 2.23 [2], [38]. Both examples contain thirty PEs such as
ten ARM 7 cores, ten private memories, five traffic generator and five share memories.
In the first example, five AMBA AHB buses are connected any of five shared
memories by the crossbar switch. This topology can be roughly approximated to local-
bus global-star among our analytic topologies. The second example is 5x3 mesh
topology however two PEs are connected to a single switch unlike a conventional
mesh. On top of the two platforms, a multimedia benchmark is run on the ARM
processors with heavy synchronization activity since it modeled a producer/consumer
pipeline of multimedia processing. The interconnection fabrics are synthesized with
0.13µm technology libraries [38].
- 44 -
5x5 Crossbar sw
itch
Fig. 2.23: Validation Examples and those layout views in 0.13µm technology
I compared the results from analytical models and the simulated results as shown in
Fig. 2.24. The energy and latency figures are normalized to the multilayer bus. Since
the simulation model uses different packet format, different flit size and flow control
scheme, the absolute numbers are different. However, in this work, the relative trends
between various topologies are more emphasized. Even though the two examples are
- 45 -
not sufficient to see the performance and cost trends, the plots in the figure shows
similar trends.
Fig. 2.24: Comparison between the proposed analytical model and the simulation results
- 46 -
2.7 Summary
In this chapter, a throughput/latency/energy/area-oriented topology exploration is
performed for Networks-on-Chip design. The evaluated topology candidates include
not only flat topologies such as a bus, mesh, star and point-to-point but also sixteen
hierarchical and heterogeneous topologies. The evaluation method uses technology-
independent analytical models with implementation-based physical parameters.
This work reveals the performance and cost relationship in a hierarchical way
according to the number of integrated cores and various traffic patterns. As a result,
the hierarchical bus is no longer cost-optimal structure as the network interconnects
more than 25 nodes. The mesh topology is also not a cost-efficient compared with
hierarchical topologies. The hierarchical topologies perform better than the flat
topologies as the traffic gets more realistic with locality properties. Among them the
hierarchical star topology shows the best cost-efficiency and also the lowest latency.
This work provides the NoC designers with the fundamental understanding on the
performance and energy/area cost trends of not only the conventional flat topologies
but also the hierarchical and heterogeneous topologies by a straightforward and
scalable analytical method.
- 47 -
Chapter 3 NoC Architecture
For the design of Networks-on-Chip, there are lots of things to be decided such as
communication-protocol, network-topology, switching-style, clock-synchronization
method, signaling scheme and so on. In this section, I will explain what I decided and
why I made such decision at each design-stage with an important basic idea, low-
power consumption.
- 48 -
3.1 Circuit Switching and Packet Switching
To put it bluntly, intra-cluster star network does circuit switching. In circuit
switching, a physical path from the source to the destination is reserved prior to the
transmission of the data. Inside a cluster, once a transmission is granted, then it is not
corrupted by other transmission and not stored in buffer, thus deterministic packet
delay is guaranteed if processing units are only synchronized. Such predictable and
deterministic service is crucial requirement for software programming and QoS
(Quality of Service) guarantee is more important for real-time applications. Moreover,
due to the circuit switching, area and power consuming buffer is not needed in the
intra-cluster networks. But circuit construction and destruction cause communication
latency in multi-hop networks. Fortunately, intra-cluster network is star-topology, i.e. a
single hop count. Thus the latency overhead is the minimum. When there is no
transaction on the reserved circuit, throughput on the channel can be severely
degraded. To prevent such a defect of the circuit-switched network, the reserved
circuit is automatically disconnected after each packet transmission. Therefore other
processing unit can use the channel without waiting time. In order to transmit multiple
packets, burst transaction protocol is provided. During the burst transaction, the
channel is not destructed and not corrupted by other transactions. This protocol issue
will be discussed in section 2.5 in more detail.
- 49 -
Inter-cluster traffic has longer end-to-end latency and the amount of traffic is much
less than that of intra-cluster traffic. Peripheral cluster operates much slower than main
cluster. If the inter-cluster network does circuit switching, the inter-cluster channel
will show very low utilization due to the slow response of peripheral units. Therefore,
packet switching, i.e. store-and-forward switching, is more appropriate for inter-
cluster networks. The main advantage of packet-switching is that it permits statistical
multiplexing on the channel. That is, the packets from many different sources can
share the channel, allowing for very efficient use of the fixed capacity. There are
packet buffer spaces at the both ends of the inter-cluster channel. The shift-register
type buffer of single packet capacity takes 200x140µm2 area and consumes 6.5mW at
1.6GHz frequency.
- 50 -
3.2 Synchronization
A state-of-the-art or near-future SoC can be seen as a heterogeneous multi-
processing system, with multiple timing references, because of the difficulty of global
synchronization as well as PU-independent power management skill, i.e. dynamic
frequency scaling. Fig. 3.1 shows the proposed synchronization structure in a NoC.
Each processing unit operates with its own clock, but it communicates with a unique
clock, CLKNET. Network Interface (NI) changes the timing reference from CKn to
CLKNET and vice versa. It is possible to synchronize the CLKNET inside a cluster
because the physical area of a cluster can be within a single clock domain. But, it is
very difficult to synchronize the CLKNET for all of clusters. Therefore inter-cluster
communications become mesochronous situation – same frequency but different skew
– and synchronization scheme is needed for reliable transmission. In this
implementation, packet buffer between switches play a role of the FIFO-
synchronization circuits naturally. I use also source-synchronous scheme where strobe
signal goes along with packet data. The strobe signal is used as timing reference to
latch the packet data at the receiver end.
- 51 -
SW SW
PU1
NI
PU2
NI
PU3
NI
SW
CK1 CK3
CK2
CK1 CK2 CK3
CLKNET
CLKNET is not sych.between clusters
intra-clusterCircuit
switchinginter-cluster
Packetswitching
Cluster A Cluster B
STROBEPacket HADDRDATA
Fig. 3.1: Synchronization structure in a NoC.
- 52 -
3.3 Serialization
On-chip serial communications has many advantages over multi-bit parallel
communications in many challengeable issues such as signal skew, crosstalk, area cost,
writing difficulty [9], [13]. In this implementation, a packet of 80bits – 16bit-header,
32bit-address and 32bit-data – is serialized into 8bits by SERDES circuits inside the
NI. The serialization method is different from the previous work [9] as shown in Fig.
3.2. In the previous implementation, header, address and data have their own channels
of 4bits, 8bits and 8bits respectively with 800MHz frequency. In this implementation,
however, they are multiplexed onto 8bits channel by time-sharing manner with
1.6GHz frequency. Therefore the area of the network is further reduced by 1/4 due to
the smaller channel width. Speaking of the network bandwidth, our serialization style
has advantage over the previous style. For example, in case of read operation, the
request packet doesn’t have to contain data field and response packet doesn’t need
address field either. While the previous serialization method always shows fixed
packet length, this implementation shows shorter and variable packet length by
removing the unnecessary field. As a result, utilization of a shared channel can further
increase. In order to indicate the packet length at line-speed, sideband End-of-Packet
(EOP) signal is need.
- 53 -
20b
8b
Packet length in timeC
hann
el w
idth
Addr.Data Hd
This work
Previous work
Addr.
Head
Data
Fig. 3.2: Two serialization methods.
- 54 -
3.4 NoC Protocol
Fig.2.5.1 shows a NoC protocol such as packet format and packet transactions. The
NoC protocol supports burst packet operations for large data transactions with length
of 2, 4, or 8 packets. The burst transaction is not interleaved with other flow packets.
Burst read request packet contains only base address but its response contains
successive data from the base address by incrementing 4 as shown in Fig. 3.3(b). The
first packet has full information but other following packets have the minimum
information required for routing in their compact header. Burst write request of length
4 send only the base address at the first packet as shown in Fig. 3.3(c). The other
following packets have only compact header and data field. By using the burst
transaction with compact packets, total transaction time is reduced by half compared
with multiple single packet transactions.
C
E
Dest. ID
Source ID
Pr
W
Rv
A
D
AC
BL
Address (32b)
Data (32b)
C: Compact packet indicatorE: Bus encoding indicatorDest. ID: Destination Network IDSource ID: Source Network IDPr: Packet priorityW: Write/ReadRv: ReservedA: Address field enableD: Data field enableAC: Acknowledge requestBL: Burst length (1, 2, 4, 8)
Header Information HA
H D D D DMaster Slave
Compact Packet
BL=4
HA
ACMaster Slave
Acknowledge Packet
DDDD
BL=4AC=1
(b) Burst read transaction with BL=4
(c) Burst write transaction with AC request(a) Packet format
Compact Header
A+4 A+8 A+12A
Fig. 3.3: Packet format.
- 55 -
A switch arbiter peeks at the destination ID in the header field and searches the
output port number in a look-up table. 1-bit priority information in the header enables
differentiated scheduling among packets. The priority can be determined by software,
Operating System (OS) or application programs. For more reliable transaction
acknowledgement request is possible as shown in Fig. 3.3(c). The protocol provides 1-
bit sideband back-pressure signal for congestion control in the networks. The back-
pressure signal is asserted when a packet buffer exceeds predetermined threshold or
destination PU cannot service temporarily.
The NoC protocol, named BONE, was upgraded from previous versions. The
completeness and reliability of the protocol functionality was verified by high-level C-
based simulator, BONE-SIM [24].
- 56 -
3.5 Queuing Buffer and Memory Design
Queuing buffer is used in the input port of a switch and in the network
interface too. The queuing buffer consumes the most area and power among
composing building blocks in the on-chip network. The buffer circuits can be designed
by two different memory units, either registers (flip-flops) or static-RAM cells. Fig.
3.4 shows four different register designs: (a) a conventional Shift-Register, (b) Push-In
Shift-Out register, (c) Push-In Bus-Out register and (d) Push-In Mux-Out register. A
SRAM style design is also shown in Fig. 3.5.
(a) Conventional Shift Register
sh sh sh shshRreq
Packet Out
Packet In
IntermediateEmpty Bubble
(b) Push-In Shift-Out Register
en en en enen
Rreq
Packet Out
Packet In
New Arrival
New Arrival
controlleroff off on0 1 1on onWreq
D Q D Q D Q D Q D Q
DQ D Q D Q D Q D Q
(c) Push-In Bus-Out Register
en en en enen
Rreq
Packet In New Arrival
controlleroff offon off offWreq
QD D D DD
Packet Out
(d) Push-In Mux-Out Register
en en en enen
Rreq
Packet In New Arrival
controlleroff off on off offWreq
QD
QD
Q
D
Q
D
Q
DPacket
Out
Read Ptr.
Q Q Q Q
FirstIn First
In
0 0 0 1 0
FirstIn
FirstIn
*
Figure 3.4: Register designs for queuing buffer.
- 57 -
Figure 3.5: Dual-port SRAM design for queuing buffer.
In a conventional Shift-Register, intermediate empty cells can exist when the
packet in/out rates are different temporally in any case. Shifting all the registers at
every packet-out consumes huge amount of power. Furthermore the min. latency in a
queue is as long as the physical queue length rather than the backlog. Although this
design is the simplest, it is not desirable to implement on a chip due to its longer
latency and unnecessary power dissipation.
To remove the intermediate empty bubble, the arrival packet can be stored at
the front empty place rather than the tail of a queue. This input style is called as ‘Push-
In’ as illustrated in Fig. 3.4(b). It can remove unnecessary latency and power
consumption caused by the empty bubble. Only the occupied register cells are enabled.
However, the shifting register style still consumes unnecessary power by shifting all of
the occupied cells at every output packet. To avoid the shifting operation, the outputs
- 58 -
of all registers are tied to a shared output bus line via tri-state buffers as shown if Fig.
3.4(c). The register holding the first-in packet is connected to the output bus by
turning-on the tri-state buffer. In this design, only a cell, in which a new arrived packet
is stored, is enabled. As the queuing capacity increases, the capacitance of the shared
bus wire increases as well because of the parasitic capacitance of tri-state buffers, and
the delay and power consumption are also enlarged. To eliminate this effect, output
multiplexers can be used as shown in Fig. 3.4(d).
These register-based implementations have a definite limitation in their capacity
because of the area and power constraints. As the queuing capacity rises up to a dozen
of packets, the register-based implementation is not good in both respects of area and
power [29]. Therefore the queuing buffer should be designed based-on a dual-port
SRAM-cell for large capacity queuing as shown in Fig. 3.5. The figure shows the cell
circuit and its layout also. A SRAM cell occupies only a tenth of a register (Flip-flop)
area. This area ratio will not be largely varied with technology.
- 59 -
Chapter 4 Low-Power Techniques In this implementation, I proposed various low-power techniques in physical layer,
data-link layer, network layer and transport layer. In this section, those techniques will
be presented briefly.
- 60 -
4.1 Low-swing signaling
The global link connecting clusters are usually a few millimeters long. Therefore, it
suffers from its longer latency and higher power consumption than a local link does,
making cross-chip communication increasingly expensive. Low-swing signaling can
alleviate the energy consumption significantly and overdriving signaling improves its
delay [14].
4.1.1 Driver circuits
There is a limited degree of freedom in the design of drivers. If a reduced supply
voltage, VDDL, is used, the power dissipation is reduced to (VDDL/VDD)2 [Fig. 4.1(a) and
(b)]. The VDDL can be provided from an off-chip or generated on a chip by DC-DC
converting from the main supply, VDD. Unless the additional power supply is applied,
the low-swing can be obtained by exploiting a transistor Vth drop or pulse enabled
driving [Fig. 4.1(c) and (d)]. However, these designs are susceptible to process
variation and noise. In the pulse-controlled driver, the Vswing is determined by not only
the pulse duration but also the Cw which is hard to be estimated in design stage. The
most widely used driver is the type of (b) [14], [11], [32]. By using a NMOS pull-up
transistor instead of a PMOS transistor, faster rising-time on the output wire is
obtained with smaller transistor size
- 61 -
Fig. 4.1: Driver Circuits for Low-Swing Signaling.
4.1.2 Receiver circuits
There are two design options in receivers with regard to the noise immunity; a single-
ended level-converter [Moisiadis00], [40], [41] or differential amplifier [14], [42],
[43], [11]. Differential signaling is more immune to noise due to its high common-
mode rejection, allowing a further reduction in the swing voltage [32], [14]. Although
the differential signaling requires double wires, the wiring congestion can be
alleviated by using on-chip serial links [44]. Zhang evaluated many different receivers
including a pseudo-differential amplifier in respect of energy, delay, swing and SNR
[32].
- 62 -
Fig. 4.2: Clocked differential sense-amplifier.
In the on-chip interconnection network, the receiver circuits should be light-weight
and occupy as small area as possible because it is used abundantly in most of the
network interfaces. Fig. 4.2 shows an example of a simple clocked sense-amplifier
with a three-stage CMOS inverter chain. PMOS transistors are used as the input gates
in order to receive a low common-mode input signal. The sizes of the input gates and
their bias currents are chosen to amplify the desired low-swing differential input.
(WP/LP = 3µm/0.18µm, Vswing = 0.2V, VDD = 1.6V, Area = 10 x 15µm2 [11])
4.1.3 Static and Dynamic Wires
There are two kinds of wires; static and dynamic wires. To speed up the response of
- 63 -
wires, it is precharged to VDDL through PMOS transistors before each bit transition.
After the precharging phase, the wire is conditionally discharged by the pull-down
transistor of a driver. This dynamic signaling is used for multidrop buses having large
fan-in and large fan-out. In the network on chip, however, the link is point-to-point so
that there is only single fan-in and fan-out. Therefore dynamic wire is not a good
candidate for on-chip networks especially when the wire has long latency.
Furthermore it is susceptible to noise.
4.1.4 Implemented Low-swing Link
TX1
9 clk
Transmitter
PACKET_OUTRXTX
CRC
ob
VSWINGVDD
oIN
10/0.18
PredriverNMOS Driver
10/0.18
Clocked Sense Amp.
inb
clk
inout3/0.18
nSTB STB3/0.18
STB
Clock Restore Circuit
inverter amp.
STROBE_OUT
inverter amp.
5.2mm wires in zigzags
VDD VDD
STROBE_IN
PACKET_IN
VDD VSWING < VSWINGVDD
P1P2
Fig. 4.3: Low-swing signaling and its transceiver circuits.
Figure 4.3 shows the implemented low-swing and differential signaling. Global
- 64 -
wires between main cluster and peripheral cluster are laid out in zigzags to emulate a
global link as long as 5.2mm without repeaters. Transmitter drives the wires using
VSWING, less than VDD, and receiver restore the swing to its normal voltage, VDD. The
driver uses N-fet for both pull-up and pull-down instead of an inverter to exploit their
lower linear resistance at small Vds. Because the gate-voltage of the pull-up NMOS
transistor is higher than its drain-voltage, Vth-drop is not observed on the wire. In
addition, since the size of NMOS transistor is smaller than equivalent PMOS transistor
by 2/5, the capacitive load of pre-driver is reduced.
A simple clocked sense amplifier followed by a 3-stage inverter chain performs
full logic amplification of low-swing. PMOS input gates are used in order to receive a
low common-mode input signal. The sizes of the input gates and bias current are
chosen to amplify as low as 200mV swing to 1.6V full-logic swing with small delay
penalty. A clock signal for the clocked sense amplifier is regenerated from a STB
signal by a clock restore circuit (CRC). In the CRC, an inverted input, nSTB, is fed to
the P1-gate in order to reduce the standby current. When standby (STB=0 or nSTB=1),
the gate voltage of P1 transistor increases as high as VSWING, thus bias-current
decreases. Due to this scheme, the standby bias current of the CRC becomes almost
zero.
- 65 -
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Transmitter
Receiver
Driving voltage: VSWING (volts)
Ener
gy (p
J/bi
t)
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.10.4
0.5
0.6
0.7
0.8
0.9
Ener
gy x
Del
ay (p
J/bi
t x n
sec)
400Mbps800Mbps1.6Gbps
400Mbps800Mbps1.6Gbps
Driving voltage: VSWING (volts)(a) (b)
Optimal point
Fig. 4.4: (a) Energy consumption, (b) energy and delay product versus voltage
swing at various signal rates.
C. Svensson demonstrated the existence of an optimum voltage swing for the
lowest energy dissipation on long wire signaling [15]. In our implementation, the
VSWING scales to the most energy and performance efficient voltage swing at each
operating frequency of the on-chip network. To find out the optimum voltage swing at
which the energy and delay product is the smallest, I conducted post-layout
simulations with a precise capacitance and resistance wire model. A 5mm metal2 wire
of 0.5um width and 1.1um space has 330fF parasitic and 100fF coupling capacitance.
VSWING scales from 0.25V to 1.1V with 50mV step and signaling rates are 400Mbps,
800Mbps, and 1.6Gbps. See Fig. 4.4(a). The required energy on the transmitter to
create a certain voltage swing on the wires decreases linearly with decreasing swing,
whereas the energy to amplify this signal back to a normal logical swing increases
superlinearly with decreasing swing. The optimum VSWING exists due to such opposite
- 66 -
trends of transmitter and receiver.
According to the result, smaller energy is needed at higher signal rate. The reasons
are followings. At 1.6Gbps signal rate, the wire does not swing fully up to the driving
voltage, because the sum of rising and falling time is longer than the signal period.
Therefore, the energy needed on the transmitter decreases as the signaling rate
increases. At the receiver amplifier, constant bias current is consumed during the clock
is HIGH. Since the absolute time of clock = HIGH gets shorter as the signal rate
increases, the amplifier consumes less energy at higher frequency.
Fig. 4.4(b) shows energy and delay product versus VSWING. The delay from a
transmitter to a receiver is about 0.9nsec and its variations according to the VSWING or
signal rates are as small as ± 40psec. As shown in the figure, the optimal swing
voltage is 0.45V, 0.40V, and 0.30V at 400Mbps, 800Mbps and 1.6Gbps signal rates,
respectively. At each operating mode, the driving voltage scales to the optimal voltage
obtained above. Due to the low-swing signaling, the power dissipation on the global
link is reduced to 1/3 of that on a full-swing repeated link. Plus, there are no area-
consuming repeaters on the wires.
- 67 -
4.2 Mux-Tree based Round-Robin Scheduler
4.2.1 Switch Scheduler
To arbitrate the output conflicts, a scheduler is used on each output port. The
arbitration scheduling adds to the latency, power and area of the switch design. The
latency of the arbiter becomes larger than that of switch fabric as the switch size gets
bigger than 16x16 [44]. Furthermore the area cost is not ignorable when you use a
serialized link [11]. As shown in Fig. 4.5, the scheduler occupies similar area like the
switch fabric when the phit width of a port is 10-bits. Therefore the scheduler design is
important as much as the switch fabric.
- 68 -
Scheduler(70x80um2)
6x6 cross-pointswitch fabric
(220x220um2)
Figure 4.5: A 6x6 Switch Layout (Phit with of a link: 10-bits).
Round-robin scheduling algorithm is most widely used in the on-chip network [11]
because of its fairness and no-starvation properties. The round-robin scheduler can be
implemented by using two priority encoders [17] or Mux-Tree-connected logic [46]. A
Pseudo-LRU algorithm and its implementation were also proposed for lower area and
lower latency than those of the round-robin algorithm [47], [48].
- 69 -
4.2.2 Mux-Tree based Round-Robin Scheduler
grant<1:0>
(a) Mux-Tree based implementation (b) Tiny Arbiter (TA)
logn
thermoencoder
nTA
TA
TA
TA
TA
TA
TA
- 1
grant<2:0>
Grant
grant<0>
token
REQ 76
5
4
3
2
1
0
logn
U.token
L.tokenQ.req
UorLU.req
L.req
Q.token
DQ
pointernext to pointer
RQ TK RQ TK0 X 0 X1 X 0 X
1 1 1 01 0 1 1
0 X 1 X
XU
UL
L
1 1 1 11 0 1 0
UU
Upper Lower UorL
U>L
U=1,L=0
Fig. 4.6: Mux-Tree based Round-Robin Scheduler: (a) block diagram, (b) Tiny Arbiter.
A scheduler (or arbiter) is needed in a crossbar switch when more than two input
packets from different input ports are destined for the same output port at the same
time. Among a number of scheduling algorithms, a round-robin algorithm is most
widely used in ATM switches and on-chip networks due to its fairness and lightness
[16], [17]. There are many ways on how to implement the round-robin algorithm [17]-
[19], [16]. Fig. 4.6 shows a Mux-Tree based implementation whose structure is
highly-modulated and scalable. A scheduling latency is O(log n) and required
resources are O(n), where n is the number of ports in a crossbar switch.
The round-robin scheduler has a rotating pointer that indicates a recently granted
- 70 -
port. A port next to the pointer has the highest priority to be granted. For example,
request vector <7:0> = 01100010 where underline means a position of the pointer.
Then, port <4> has the highest priority and the lower group of port <4:0> has higher
priority than upper group of port <7:5>. This information is given by a thermo-
encoder whose output becomes token <7:0> = 00011111. Therefore, port <4:0> have
their tokens. A request from a port having a token ‘1’ acquires higher priority than who
has no token. These request and token vectors are inputs of the binary Mux-Tree
which consists of Tiny Arbiters (TA) at each node. Each TA selects one of two ports,
upper one lower one, based on a table shown in Fig. 4.6(b). When both of two requests
have tokens, TA selects upper port because the pointer rotates in decreasing order.
UorL, Q.req and Q.token bits generated at each TA propagate and inputted to its parent
node. Then one of two children’s UorL bits is selected by 2:1 MUX based on their
parent’s UorL bit. Then, the selected child-UorL bit and its parent-UorL bit are
concatenated and propagate again. By doing so successively up to root node, the
granted port number is determined finally.
The Mux-Tree based implementation is compared with four other designs such as
EXH, SHFT_ENC, RIPPLE, and DUAL_SPE presented in [17]. Fig. 4.7 shows
comparison of power consumption and scheduling delay simulated with 8input ports
in 0.18um process technology. This work, Mux-Tree, performs the minimum power
- 71 -
and delay product; 136uW and 1.05nsec delay at 100MHz and offered load of 50%.
This work also requires the minimum number of transistors, i.e. area, except RIPPLE
design as shown in Table 4.1.
EXH
Dual_SPE
This work
SHFT_ENC
RIPPLE
0
100
200
300
400
0 1 2 3 4
Pow
er [u
W]
Delay [nsec]
Fig. 4.7: Power and delay comparison with other round-robin implementations [17].
TABLE 4.1 COMPARISON OF THE NUMBER OF REQUIRED TRANSISTORS
RIPPLE
EXH
SHFT_ENC
DUAL_SPE
This work
8 ports 16 ports
403
1435
629
573
569
927
6879
1711
1483
1203
- 72 -
4.3 Crossbar Partial Activation Technique
4.3.1 Switch Fabric Design
4:1 MU
X
4:1 MU
X
4:1 MU
X
4:1 MU
X
Fig. 4.8: (a) switch structure, and (b) cross-point and (b) Mux-based switch fabric.
The conventional switch consists of Input Queue (IQ), scheduler, switch
fabric and Output Queue (OQ) as shown in Fig. 4.8(a). There are two kinds of switch
fabric design: a cross-point and Mux-based switch fabric as presented in Fig. 4.8(b)
and (c), respectively. The cross-point switch has pass-transistors at each crossing
junction of input and output wires. In this switch fabric, the capacitive loading driven
by input driver is junction capacitance of pass-transistors on input and output wires
and the wire capacitance itself. The voltage swing on the output wire is reduced to
VDD-Vth_N because of the threshold voltage drop of the NMOS pass-transistor thus the
power dissipation is reduced. However, this design is hard to be synthesized. The
- 73 -
fabric area is determined by the wiring area and not by the transistors so that its area
cost can be the minimum. The Mux-based switch uses multiplexer for each output port.
The capacitive loading driven by the input driver is the input gate capacitance of the
multiplexers and input wire capacitance.
Table 4.2 Comparison of two designs of the switch fabric: power, delay and area.
Power [mW] Delay [psec] Area [mm2]
Switch size
Cross-point
Mux-based
Cross-point
Mux-based
Cross-point
Mux-based
4x4 7.7 12.4 300 370 0.038 0.059
8x8 23.2 52.2 460 580 0.154 0.235
16x16 76.8 217.2 740 1000 0.614 0.941
Table. 4.2 presents the power, speed and area comparison of the two different designs.
It is a simulation result using a capacitive wire model extracted from physical layout.
The power consumption of a cross-point switch is much lower than that of a Mux-
based design: 37%, 56% and 65% lower for 4x4, 8x8 and 16x16 switches, respectively.
Furthermore, the delay of a Mux-based switch is longer than that of a cross-point
switch because of the multiplexer gate-delay. The cross-point switch occupies 65%
area compared to the Mux-based switch.
- 74 -
4.3.2 Crossbar Partial Activation Technique
main RBsc0
sc1
sc2
sc3
sc4
sc5
sc6
sc7
Scheduler x 8In
put d
river
Output driver
n
(a) Conventional crossbar (b) Proposed crossbar with partial activation
k
sc0
sc1
sc2
sc3
sc4
sc5
sc6
sc7
sub RB
sub CB
1
1
Fig. 4.9: Schematic diagram of (a) an 8 x 8 conventional crossbar and (b) a proposed crossbar with partial activation technique.
A conventional X-Y based crossbar fabric is shown in Fig. 4.9(a). An n x n
crossbar fabric comprises n2 crossing junctions which contain NMOS pass-transistors.
A NMOS only pass-transistor is used rather than CMOS transmission gate in order to
reduce the voltage swing to VDD-Vth and also to reduce gate loading. In the
conventional crossbar fabric, however, each input driver wastes power to charge and
discharge two long wires – Row-Bar (RB) and Column-Bar (CB) – and 2n transistor-
junction-capacitors. The RB and CB should be laid out in lower metal layers, M1 or
M2, in order to reduce the fabric area, and to minimize the resistance of via. Therefore
the loading on the driver output becomes significant as the number of ports increases.
- 75 -
Fig. 4.9(b) shows a proposed crossbar switch with Crossbar Partial Activation
Technique (CPAT). By splitting the n x n fabric into 4x4 fabrics (or tiles), the
capacitive loading activated is reduce by half. A gated input driver at each tile
activates its sub-RB only when its tile gets a grant from scheduler. Only 4 four-input
OR-gates are needed additionally in each tile regardless the depth of a channel, k. The
output line, CB, is also divided into two sub-CBs to prevent the signal propagation
into other tiles. A 2:1 MUX connect one of two sub-CBs to the output port according
to the grant signals from its scheduler.
Offered load
0510152025
0123456789
10% 30% 50% 70% 90%
Conventional crossbarCrossbar with CPAT
Pow
er c
onsu
mpt
ion
[mW
]
Pow
er re
duct
ion
[%]
810
1619 22
Fig. 4.10: Power comparison of an 8x8 crossbar fabric with - and without - crossbar partial activation technique.
An 8x8 crossbar fabric with CPAT is comparatively analyzed with a conventional
one. In this crossbar design, RBs and CBs are laid out in M2 and M1 layers,
respectively. The area of the fabric is about 240x240 um2. According to the
- 76 -
capacitance extraction from the layout, capacitance of a RB and a CB onto the
substrate are 44fF and 28fF, respectively and coupling capacitance between adjacent
bars is 13fF. Fig. 4.10 shows the comparative power consumption according to offered
load. As the offered load increases, the power reduction becomes more significant. At
90% offered load, 22% power saving is obtained. The CPAT cuts down the capacitive
loading on RBs and CBs by half. However, because the power consumption on output
drivers and main RBs are not reduced, the power reduction doesn’t exceed 25%. The
additional OR-gates and MUXs consume less than 2% of overall power. When CPAT
is applied to 16x16 crossbar switch which is dived into 4x4 tiles, 43% power saving is
obtained.
- 77 -
4.4 Low-Energy Coding on On-chip Serial Link
There are many researches on the bus coding for reducing the switching probability
such as bus-invert (BI) coding [49], gray-code [50], T0-code [51] , partial bus-invert
coding [52], probability-based mapping [53] and so on. However, there was also a
report on the ineffectiveness of those on-chip bus coding techniques because the
power dissipation on the (de)coder is comparable to the power saving obtained by the
coding when the wire length is not longer than few tens-mm [54]. Furthermore those
bus coding schemes are effective to parallel buses but ineffective in the multiplexed
channel used on packet switched networks.
As the link wires connecting processing units and switches are abundantly used, the
wiring congestion will becomes one of major challenging on the network-on-chip
design. To alleviate the wiring congestion, a narrow channel [55] or on-chip serialized
channel [9], [11] are proposed. In serial communications, the wire frequency, f, is
multiplied by serialization ratio to support the same bandwidth as in parallel
communication but the number of wires, N, is divided by the serialization ratio. Thus,
the product of f and N is the same in serial and parallel communication channels and
the serialization ratio can be determined by trading off between f and N. However, the
switching activity factor of serial wire, α, is different from that of parallel wires and
the difference depends on the data patterns. Figure 4.11 shows an example for the
- 78 -
comparison of a number of transitions in parallel and serial communications. In this
example, 8bit parallel bus has 7 transitions. However, when the same data stream is
serialized onto a single wire, the number of signal transitions on the wire increase up
to 31. If there is correlation between adjacent data words, some bits of the parallel bus
stay calm without any transition. However, such correlation is not helpful in the serial
communication because data bits are multiplexed onto the single wire. Therefore, the
activity factor of the serial wire gets higher than that of parallel bus statistically. In
common multimedia applications, the most significant bits tend to have high spatial
and temporal correlations because of the sign extension or the locality characteristics
of multimedia streams [20]. In these applications, the serial communication dissipates
more energy than the parallel communication.
- 79 -
(a) 8bit parallel bus
(b) single bit serial bus
01010001
D7D6D5D4D3D2D1D0
W0 W1 W2 W3 W4
7 transitions
01010010
01010011
01010100
01010101
D7
D6
D0
serial data
D7D6D0
SER
IAL
IZE
R0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0
W0 W1
0 1 0 1 0 0 1 1W2
0 1 0 1 0 1 0 0W3
0 1 0 1 0 1 0 0W4
31 transitions
Fig. 4.11: An example of a number of transitions with the same data pattern on (a) parallel wires and (b) a serial wire.
Many parallel bus coding methods have been proposed to reduce the switching
power on the address or data bus between a processor and memories. However,
such conventional parallel bus coding methods cannot be employed in the serial bus.
Therefore, I proposed a serialized low-energy transmission (SILENT) coding
technique to minimize the transmission energy on the serial wire [21]. In this coding,
transitions are only encoded as symbol ‘1’, thus the difference between successive
data words are encoded.
The encoder works as follows:
- 80 -
B(t)[i] = b(t)[i] ⊕ b(t-1)[i] for i = 0 ~ n-1 (1)
b(t)[n-1:0]: n-bit data word from a sender at time t
B(t)[n-1:0]: n-bit encoded data word at time t
By serializing the encoded data words, the frequency of the appearance of zeros on
the wire increases because of the correlation between the successive data words, b(t).
Figure 4.12 shows an example for the advantage of this coding method. All bits from
B[7] to B[3] become zeros after these data words are encoded because those bits do
not change with time. Serializing these encoded words reduces the number of
transitions of the serial wire as shown in Figure 4.12(d) and the wire looks quiet or
silent. In this example, a conventional serial wire without the SILENT coding, shown
in Figure 4.12(c), has three times as many transitions from t+1 to t+4. By reducing the
number of transitions on the serial wire, the transmission energy can be saved
proportionally.
- 81 -
01010001
01010010
01010011
01010100
01010101
t t+1 t+2 t+3 t+4
b[7]b[6]b[5]b[4]b[3]b[2]b[1]b[0]
2/8 1/8 3/8 1/8
(a) data words from sender
α
01010001
00000011
00000001
00000111
00000001
t t+1 t+2 t+3 t+4
(b) encoded data words
B[7]B[6]B[5]B[4]B[3]B[2]B[1]B[0]
t 0 1 0 1 0 0 0 1
t+1 0 0 0 0 0 0 1 1
52222
serial data
t+2 0 0 0 0 0 0 0 1t+3 0 0 0 0 0 1 1 1t+4 0 0 0 0 0 0 0 1
(d) serial data with coding
t 0 1 0 1 0 0 0 1t+1 0 1 0 1 0 0 1 0t+2 0 1 0 1 0 0 1 1t+3 0 1 0 1 0 1 0 0t+4 0 1 0 1 0 1 0 1
57577
serial data(c) serial data without coding
CODING
SERIALIZINGSERIALIZING
# Tr # Tr
Fig. 4.12: (a) original data words, (b) encoded data words, (c) conventional serial data with 31 transitions and (d) encoded serial data with 13 transitions.
Figure 4.13 shows the circuit implementation of SILENT codec and the bold line
indicates a critical path in their circuits. It requires a single gate delay for encoding or
decoding and an additionally MUX-delay for enable or disable control. The power
consumption for 32bit data word encoding and decoding is about 390µW and 385µW,
respectively, at 100MHz frequency in the worst case data pattern.
- 82 -
b(t)
E
E(a) encoder
(b) decoder - I E
(c) decoder - II
B(t)
d(t-1)
D(t)d(t)
D(t)d(t)d(t-1)
d(t-1)
b(t-1)
Fig. 4.13: Circuits implementation of (a) encoder and (b, c) decoders.
- 83 -
Avg
. Pow
er C
onsu
mpt
ion
[mW
]
0
2
4
6
8
10
12
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
# of transitions b/w successive data words
Energy savingrange
w/o CODING
w/ SILENT CODING
overhead
overheadrange
Energy savingrange
Fig. 4.14: Average power consumption on serial communications with and without SILENT coding.
In order to analyze the energy efficiency of this coding scheme, I evaluate the
energy consumption in the serial communication channel containing 32b-encoders,
32-to-4 serializer, 4bit 8mm serial wires with repeaters, 4-to-32 deserializer, and 32b-
decoders. The energy consumption in the communications depends on the data
patterns to be sent. So, I evaluate the power consumption with all possible variations
from a random data word. Figure 4.14 shows the comparison of the average power
consumption of the serial communication with and without SILENT coding at
100MHz operating frequency. The x-axis stands for the number of data displacement
between successive 32bit data words, b(t). The 0 on the x-axis means that b(t) is the
- 84 -
same as b(t-1), and the 16 means that arbitrary 16bits among 32bits, b(t), have
changed from their previous values, b(t-1). In result, the region under 12 or above 21
in the x-axis is energy saving region due to the SILENT coding. However, there is
some power overhead for random data transitions at most 14% in a region from 12 to
21. As shown here, the energy saving range is two times wider than the overhead
range and the amount of power saving is much larger than the overhead. Therefore,
the SILENT coding has lots of opportunity to save energy in the most of data patterns.
To evaluate the performance of the proposed SILENT coding in a real application,
I trace the transactions of the on-chip traffic between a RISC processor and system
memories while a 3D Graphics application is running [22]. Full 3D Graphics pipelines
of geometry and rendering operations are executed for 3D scenes with 5878 triangles.
Figure 4.16 shows the distribution of the displacement of the memory address and data
for the successive memory accesses. The instruction memory address is so sequential
that the 99.5% of 6 million transactions are within the energy saving region of
SILENT coding. Although the instruction codes are quite random, the 60% is within
the energy saving region. In the case of the data memory access, the 79% and 70% of
1.5 million data memory address and data transactions are within the energy saving
region, respectively. With this memory access pattern, I evaluated the energy
consumption for the serial communications. In result, Fig. 4.15 shows the normalized
- 85 -
average energy consumption on the serial wire with and without SILENT coding. The
energy consumption with SILENT coding includes the energy dissipation in the codec
circuits. The SILENT coding shows the best performance for instruction address,
about 77% energy saving. Even in the random traffic, in the case of the instruction
codes, 13% energy saving is achieved. It also saves 40 ~ 50% transmission energy for
multimedia data traffic. In conclusion, the SILENT coding reduces the energy
consumption of the serial communication in all kinds of on-chip data traffic in the 3D
Graphics application.
: w/o coding : w/ SILENT coding
InstructionAddress
InstructionCode
Data Mem.Address
Data Mem.Data
0
0.2
0.4
0.6
0.8
1
0.23
0.87
0.510.62
Fig. 4.15: Normalized average energy consumption in each memory access
type.
- 86 -
(a)
(b)
0
0.5M
1M
1.5M
2M
2.5M
3M
Acc
ess C
ount
s [M
illio
n]99% of address60% of code
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
insturction addressinstruction code
3D-SceneA
cces
s Cou
nts [
Tho
usan
d ]
# of transitions b/w succesive data words
79% of address70% of data
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
data memory addressdata memory data
3D-Scene
0
100K
200K
300K
400K
Fig. 4.16: Distribution of the displacement between successive (a) instruction and (b) data memory access.
- 87 -
Chapter 5 Implementation and Measurement Results
5.1 Design Flow and Methodology
The Network-on-chip is designed by semi-custom method. Integrated processors
are synthesized, memories are compiled by SRAM compiler and on-chip networks are
full-customized for low-power and high-performance design. Processors and
memories are obtained from vendors and reused by attaching the network interface
and wrappers. Fig. 5.1 shows the design flow; used EDA tools, process stage, and the
output at each stage. It takes 6months from architecture sketch to tape-out by the
manpower of 1 architect and circuit designer, 1 protocol designer and 4 layout artists.
Additional 1 package-board designer supported after the chip fabrication.
- 88 -
Protocol Define
Architecture Define
Protocol Verification
Process Outputs
Protocol Spec. Sheet
C-Model
Used EDA Tools
C-based-Simulator
C-Language
MS-VISIO, Verilog-HDL
Architecture VerificationVerilog-XL
LogicDesign
CricuitDesign
MemoryDesignVerilog-XL Cadence
OPUS
Synthesis CricuitSimulationSRAM
Compile
DesignCompiler SRAM
Compiler(Dongbu)
EPICHSpice
ManualLayout
Place &Route
SynopsysAstro
CadenceOPUS
Assemble & PnRCadence OPUS-Vurtuso
DRC & LVSMentor Calibre
Tape-Out & FabricationDongbu-Anam 0.18um CMOS
Test Board/PKG DesignCadence ORCAD
DemonstrationMesurement Equip. & FPGA
Behavioral Netlist
Block Layout
Gate-Level Netlist
Full-Chip Layout
Chip Die
Test System
Visual Demo.
GDS-II
Verilog Model
itera
tion
itera
tion
Fig. 5.1: Chip and System Design Flow.
- 89 -
5.2 Chip Implementation and Measurements
Fig. 5.2: Block diagram of a prototype SoC for multimedia applications.
With the proposed NoC architecture, protocol and low-power techniques, I
implemented a multimedia SoC as a prototype. The block diagram is shown in Fig. 5.2.
The chip integrates two clusters; a main cluster and a peripheral cluster. The main
cluster contains two RISC processors, on-chip FPGA, two 64kb SRAM, and an off-
chip gateway. Two RISC processors emulate multiprocessor systems. Off-chip
gateway (OGW) [9] enables seamless off-chip communications with other NoCs on
the same package or boards in order to compose larger scale systems. By using the
- 90 -
OGW, a PU on a die can communicate other PUs on the other dies without protocol
conversion. The peripheral cluster contains three memories to emulate peripheral slave
units. I assumed that the peripheral cluster is located far form the main cluster to
emulate a large SoC, thus two clusters are interconnected via 5mm global link. The
global link uses low-swing signaling to reduce power consumption and also
differential signaling for higher SNR. PLL generates 100MHz clock for main cluster
PUs, 50MHz clock for peripheral cluster units, and 1.6GHz network clock for
switches and network interfaces. The clock frequencies are scalable for power
management modes, i.e. 100/50/1600MHz for FAST mode, 50/25/800MHz for
NORMAL mode, and 25/12.5/400MHz for SLOW mode. PU clocks are not
synchronized each other for the emulation of systems with multiple timing references.
Therefore no effort is needed for clock skew minimization. The on-chip network
supports 3.2GByte/s communication bandwidth for each PU and 11.2GB/s aggregate
bandwidth at FAST mode. The chip is implemented using 0.18µm CMOS process with
6-Al metal layers and its die area takes 5x5mm2. Fig. 5.3 shows the die photograph.
The On-Chip Network power dissipates 51mW at FAST mode with full traffic
condition. Fig. 5.4 shows the power break-down and the effectiveness of the proposed
techniques. By using the low-power techniques such as low-swing signaling, crossbar-
partial activation, and serial-link coding, the overall power consumption is reduced by
- 91 -
38%. The power efficiency index, power consumption per bandwidth, is 4.6mW/GB/s
in this work which is only a ninth compared to previous work [9], 41mW/GB/s.
Fig. 5.3: Die photograph.
- 92 -
Fig. 5.4: On-chip network power reduction by the proposed low-power techniques.
The implemented chip is successfully measured. Fig. 5.5 shows the measured
packet signals on the network at FAST mode, and Fig. 5.6 shows the measured packets
with and without SiLENT coding while a 3D graphics application is running on the
system [22]. The effectiveness of the coding is clearly shown where there are fewer
transitions on a channel.
- 93 -
Fig. 5.5: Measured packet signal on the network.
Fig. 5.6: Measured packet signals (a) without and (b) with SiLENT coding.
Strobe
Packet<0>
EOP 1nsec
Data
1 0 0 0 0 0 1 0 1 0
Header Address
Measured by Tektronix TDS7708
(w/ 40Gbps sampling frequency)
Strobe
Pkt [0]
Pkt [1]
Pkt [2]
Pkt [3]
Pkt [4]
Pkt [5]
Pkt [6]
Pkt [7]
EOP
(a) Without SiLENT coding (134 transitions) (b) With SiLENT coding (79 transitions)
- 94 -
5.3 Network-in-Package
Four NoCs are mounted on a single 676-BGA package as shown in Fig 7(c), (d).
The Network-in-Package (NiP) needs four isolated supply voltages: 1.8V for digital
logic, 1.8V for analog circuits, 3.3V for I/O, and sub-0.5V for low-swing links. The
operating frequencies are different for each module: 50MHz for peripheral logic,
100MHz for processors, 800MHz for scheduler, and 1.6GHz for on-chip networks.
Fig. 5.7: NiP simulation (TLM, SSN) and NiP photograph.
- 95 -
The important issue for the package design is the power integrity, i.e. a design of
power and ground (P/G) network. No significant resonance should occur on the P/G
plane of the package at the operating frequency. Otherwise small noise from signals or
external system causes so significant P/G noise that P/G network becomes unstable. In
order to analyze the power integrity, I used Transmission Line Matrix (TLM) Method
and Simultaneous Switching Noise (SSN) analysis. (See Fig. 5.7(a), (b)) I integrated
decoupling-capacitors for each power voltage at the proper position: 5 for logic power,
2 for I/O power, 1 for analog power. Each capacitor has 10nF capacitance and 600pH
Effective Series Inductance properties. Fig. 5.7(a) shows the self-impedance of the
power plane. I targeted the self-impedance of the power plane as 1Ω. The solid-line is
for bare P/G plane and the dotted line is for the proposed decouple-capacitors-insertion.
The impedance of the bare plane shows inductance-characteristics and exceeds the
target impedance at 800MHz and 1.6GHz. After the decoupling-capacitance-insertion,
the impedance resonance occurs at 272.1MHz that comes from the L of bare plane and
C of the inserted-capacitors. As a result, the self-impedance at the target frequency
becomes lower than 1 Ω. Fig. 5.7(b) shows the SSN analysis results when 1.6GHz
input signals are impressed. The switching noise is reduced from 34mV to 2mV due to
the decoupling-capacitor-insertion. I also inserted ground lines between high frequency
signal lines to eliminate crosstalk.
- 96 -
5.4 Demonstration System
I developed a demonstration system with the implemented Networks-in-Package for
multimedia applications. Fig. 5.9 shows the demonstration system which consists of
NiP-board on top, video board on bottom layer, LCD module. The system is now
showing a JPEG-decoded-image on the display processed by embedded processors
through networks on the chip and networks in the package. The high-speed packets up
to 1.6GHz are measured via on-board probing PAD near the package. The Fig. 5.9(d)
shows packet transactions between two NoCs on the NiP where the two NoCs are
running at different clock frequencies, e.g. 400MHz and 274MHz.
Fig. 5.8: Demonstration-system for single-chip package.
- 97 -
(a) Evaluation-Board (mother-board)
(b) high-speed probing-PAD
(c) Images on-the-display: chip-layout, designers, and 3D-scene
NiP Display
Instruction ROM
↑ Video card@ bottom layer
- 98 -
(d) Packet and clock signals on the Network-in-Package
Fig. 5.9: Demonstration-system for Network-in-Package.
- 99 -
Chapter 6 Conclusions
A low-power packet-switched Networks-on-Chip (NoC) with hierarchical star
topology is designed and implemented for high-performance SoC design. According
to a performance and cost oriented topology analysis, the hierarchical star topology
shows the best cost-efficiency and also the lowest latency.
The chip contains two RISC processors for multiprocessor emulation, two 64kb
SRAM, on-chip FPGA, off-chip gateway for off-chip network interface, three 4kb
SRAM for peripheral logic emulation, 1.6GHz PLL for internal clock generation, and
on-chip networks connecting those processing units. On-chip network channel is
serialized from 80bits onto 8bits to reduce the network area significantly. Source-
synchronous signaling enables plesiochronous communications between processing
- 100 -
units running at different clock frequencies. Low-power consumption is achieved by
applying various techniques such as lower swing signaling on a global link, Mux-Tree
based round-robin scheduler in a router, crossbar partial activation, low-energy serial-
link coding, and clock frequency scaling. The chip integrates 2.5 million transistors
and consumes less than 160mW and the on-chip network consumes less than 51mW
delivering 11.2GB/s aggregated network bandwidth. The 5x5mm2 chip is fabricated
with 0.18µm CMOS process and the four fabricated chips are integrated in a single
BGA package to organize Networks-in-Package (NiP) for large scalable systems with
low-cost. The implemented NoC and NiP are successfully measured and demonstrated
on a system evaluation board running multimedia applications.
- 101 -
BONETM 2.0 Specification
Semiconductor System Lab., Dept. of EE, KAIST
Se-Joong Lee, shocktop@eeinfo.kaist.ac.kr
- 102 -
Build Number: 009
© copyright SSL Limited 2003. All rights reserved
- 103 -
Contents
BONE Specification
Chapter 1 Introduction to the BONE 1.1 Overview of the BONE specification ………………….. 04 1.2 Overall Architecture of the BONE ……………………... 04 1.3 Features ………………………………………………… 05 1.4 Terminology ……………………………………………. 06 Chapter 2 BONE signals 2.1 Master Network Interface (MNI) ………………………. 07 2.2 Up_Sampler (UPS) …………………………………….. 09 2.3 Switch (SW) ……………………………………………. 09 2.4 Dn_Sampler (DNS) …………………………………….. 10 2.5 Slave Network Interface (SNI) …………………………. 11 Chapter 3 BONE data-plane protocol 3.1 Packet formation ……………………………………….. 13 3.2 Packet transaction-level protocol ………………………. 15 3.3 Control signals …………………………………………. 17 3.4 NI Timing diagram (basic packet transfer) …………….. 18 3.5 NI Timing diagram (burst packet transfer) …………….. 22 3.6 NI Timing diagram (acknowledge, flow-control) ……… 30 3.7 UPS/DNS Timing diagram …………………………….. 33 3.8 SW Timing diagram ……………………………………. 35 Chapter4 BONE control-plane protocol (Under construction) Chapter5 BONE peripherals (Under construction)
- 104 -
Chapter 1
Introduction to the BONE
1.1 Overview of the BONE
The Basic On-chip Network (BONE) specification defines an on-chip
communication architecture and its protocol standards for designing high-
performance, application specific, and very large scale system-on-chips (SoCs).
The BONE which is built based on network architecture interconnects
integrated functional units or intellectual properties (IPs), provides sufficient
bandwidth and minimum latency, especially without global synchronization.
1.2 Overall Architecture
The Bone consists of 5 kinds of components: Master Network Interface
(MNI), Slave Network Interface (SNI), Up_Sampler (UPS), Dn_Sampler
(DNS), and Switch (SW). The MNI connects a master to the BONE. Using the
UPS and DNS, the BONE serializes and deserializes packets. The non-
blocking SW routes packets. A path from MNI to SNI is called forward_path,
and a reverse path is called backward_path.
- 105 -
Forward_path
DNS
MNI
Master
UPS
SNI
Slave
UPS DNS
SW
SW Backward_path
Figure 1.1.1 Overall architecture
1.3 Features
1.3.1 Physical Layer
* 32b ADDR, 32b DATA, and 8b SIG
* 7b routing information (RI) for packet routing
* fCLK_MASTER < fCLK_BONE
* fCLK_SLAVE < fCLK_BONE
* No synchronization between different clocks
1.3.2 Datalink Layer
* Serialization / Deserialization
* 2-level flow control
1.3.3 Network Layer
* 7b source routing
* Control Packet for managing the network
- 106 -
* Cut-through switching
1.3.4 Transport Layer
* Encoding / Decoding scheme to reduce packet size
* Acknowledge Packet for end-to-end hand-shaking
* Compact Packet for burst read/write operation
* Direct Signal (SIG) field for directed-mapped signal transfer
* 4 kinds of Burst Length is available: 1,2,4, and 8
- 107 -
1.4 Terminology
The following terms are used throughout this specification.
phase_1, _2 Phase_1 and phase_2 indicate front and back half period
of clock cycle.
forward_packet A packet transferred through forward_path
backward_packet A packet transferred through backward_path
burst_period Burst_period is the clock cycles in which burst packets
are transmitted.
burst_start_period Burst_start_period is the clock cycle at which
burst_period starts.
burst_ext_period Burst_ext_period is the clock cycles in which burst
operation continues after burst_start_period.
burst_read Burst_read is a read command which expects burst read
out data
burst_write Burst_write is a series of write operations whose address
space is continuous.
D0 D1 D2 D3
burst_period
burst_ext_periodburst_start_period
phase_1 phase_2
FD<31:0>
FDOEN
Figure 1.4.1 Terminologies about burst operation
- 108 -
Chapter 2
BONE signals
2.1 Master Network Interface (MNI)
MNI
FWRITE
FD<31:0>
FA<31:0>
BSOEN
BDO<31:0>
FS<7:0>
FACKREQ
BACKn
WTMS
BSO<7:0>
FDOEN
FHO<23:0>
FAO<31:0>
WTMSREQ
WTDNSREQ WTDNS
FBL<1:0>
FPRT
FAEN FDO<31:0>
FAOEN
FHOEN
BDEN
BH<23:0>
BA<31:0>
BD<31:0>
BAEN
BHEN
FSEN
BDOEN
FDEN
Figure 2.1 MNI interface diagram
- 109 -
WTMSREQ Wait Master Request.
WTMS Wait Master. Bypassing WTMSREQ. Controls packet flow from a
master
WTDNSREQ Wait Down_Sampler Request.
WTDNS Wait Down_Sampler. Bypassing WTDNSREQ. Controls packet flow
from DNS.
FWRITE Forward path HIGH for write command, LOW for read command
FBL<1:0> Forward path Burst Length (00:1, 01: 2, 10:4, 11:8)
FPRT Forward path Priority (1: High priority 0: Low priority)
FSEN Forward path Direct Signal Enable.
FS<7:0> Forward path Direct Signals
FAEN Forward path Address Enable
FA<31:0> Forward path Address
FDEN Forward path Data Enable
FD<31:0> Forward path Data
FACKREQ Forward path Acknowledge Request. NI which receives a packet
with ACKREQ must reply an ACK packet.
FHOEN Forward path Header Output Enable.
FHO<23:0> Forward path Header Output
FAOEN Forward path Address Output Enable
FAO<31:0> Forward path Address Output
FDOEN Forward path Data Output Enable
FDO<31:0> Forward path Data Output
BHEN Backward path Header Enable
BH<23:0> Backward path Header
BAEN Backward path Address Enable
BA<31:0> Backward path Address
BDEN Backward path Data Enable
- 110 -
BD<31:0> Backward path Data
BACKn Backward path Acknowledge Not. Set to ‘H’ when ACKREQ is
asserted. Set to ‘L’ when an ACK packet arrives.
BSOEN Backward path Direct Signal Output Enable
BSO<31:0> Backward path Direct Signal Output
BDOEN Backward path Data Output Enable
BDO<31:0> Backward path Data Output
Table 2.1 MNI signals
- 111 -
2.2 Up_Sampler (UPS)
UPS
WTNIREQWTNI
DATASTB
EOP
LINK<7:0>
SRCDEN
SRCH<23:0>
SRCA<31:0>
SRCD<31:0>
SRCAEN
SRCHEN
Figure 2.2 UPS interface diagram
WTNIREQ Wait Network Interface Request.
WTNI Wait Network Interface. Retimeing WTNIREQ.
SRCHEN Source Header Enable. Source may be one of MNI or SNI.
SRCH<23:0> Source Header
SRCAEN Source Address Enable.
SRCA<31:0> Source Address
SRCDEN Source Data Enable
SRCD<31:0> Source Data
DATASTB Datastb. Destination retimes packets at rising edge.
EOP End-of-packet.
LINK<7:0> Link through which packets are delivered.
Table 2.2 UPS signals
- 112 -
2.3 Switch (SW)
SW
WTSWREQxWTUPSx
DATASTBJx
EOPJx
LINKJx<7:0>
DATASTBOx
EOPOx
LINKOx<7:0>
Figure 2.3 SW interface diagram
WTSWREQx Wait Switch Request. Switch must slow down or even stop
packet transfer. ‘x’ indicates port no.
WTUPSx Wait UPS. Switch buffer is close to overflow.
DATASTBJx Datastb Input. Switched w/o retiming.
EOPJx End-of-packet Input. Switched w/o retiming.
LINKJx<7:0> Link Input.
DATASTBOx Datastb Output
EOPOx End-of-packet Output
LINKOx<7:0> Link Output
Table 2.3 SW signals
- 113 -
2.4 Dn_Sampler (DNS)
DNS
WTSW WTSWREQ
DATASTB
EOP
LINK<7:0>
DSTDOEN
DSTHO<23:0>
DSTAO<31:0>
DSTDO<31:0>
DSTAOEN
DSTHOEN
Figure 2.4 DNS interface diagram
WTSWREQ Wait Switch Request
WTSW Wait Switch. Retiming WTSWREQ.
DSTHOEN Destination Header Output Enable.
DSTHO<23:0> Destination Header Output.
DSTAOEN Destination Address Output Enable.
DSTAO<31:0> Destination Address Output
DSTDOEN Destination Data Output Enable
DSTDO<31:0> Destination Data Output
Table 2.4 DNS signals
- 114 -
2.5 Slave Network Interface (SNI)
SNI
BD<31:0>
BS<7:0>
WTDNSREQWTDNS
FH<23:0>
FHEN
FA<31:0>
WTSLWTSLREQ
BSEN
FDO<31:0>
FSO<7:0>
FAO<31:0>
FD<31:0>
FWRITE
FDEN
FAEN
BDOEN
BHO<23:0>
BDO<31:0>
BHOEN
BDEN
FBLO<1:0>
FSOEN
FAOEN
FDOEN
FTOEN
FTO<15:0>
Figure 2.5 SNI interface diagram
WTDNSREQ Wait DNS Request. A slave can performs backpressure by this
signal.
WTDNS Wait DNS. Bypassing WTDNSREQ.
WTSLREQ Wait Slave Request. A switch controls packet flow from a
slave.
WTSL Wait Slave. Bypassing WTDNSREQ.
- 115 -
FHEN Forward path Header Enable
FH<23:0> Forward path Header.
FAEN Forward path Address Enable
FA<31:0> Forward path Address
FDEN Forward path Data Enable
FD<31:0> Forward path Data
FWRITE Forward path ‘H’ for write operation. ‘L’ for read operation
FBLO<1:0> Forward path Burst Length Output
FTOEN Forward path Tag Output Enable
FTO<15:0> Forward path Tag Output. This Tag Information is used for
backward_packet from slave to master.
FSOEN Forward path Direct Signal Output Enable
FSO<7:0> Forward path Direct Signal Output
FAOEN Forward path Address Output Enable
FAO<31:0> Forward path Address Output
FDOEN Forward path Data Output Enable
FDO<31:0> Forward path Data Output
BSEN Backward path Direct Signal Enable
BDEN Backward path Data Enable
BS<7:0> Backward path Direct Signals.
BD<31:0> Backward path Data.
BHOEN Backward path Header Output Enable
BHO<23:0> Backward path Header Output
BDOEN Backward path Data Output Enable
BDO<31:0> Backward path Data Output
Table 2.5 SNI signals
- 116 -
Chapter 3
BONE data-plane protocol
3.1 Packet formation
3.1.1 Packet format
Compact Packet
7 6 5 4
DATA
3 2 1 0RIC=10
1234
C_HEADER<7:0>
Acknowledge Packet
7 6 5 401
3 2 1 0RIC=0
P 1 000 0 00
Normal Packet
7 6 5 4
SIG
ADDR
DATA
3 2 1 0RIC=0
P W SAD AC BL0123456789
10
HEADER<23:0>TAG
<15:0>
ADDR<31:0>
DATA<31:0>
Figure 3.1.1 Packet formats
- 117 -
C Compact. ‘1’ for compact packet, ‘0’ for others.
RI Route Information.
P Priority. ‘1’ for high priority
W Write
SAD 3b bitmap encoding. Each bit indicates weather the
corresponding field exists or not. Ex) 101 means Direct
Signal and Data field exist.
AC Acknowledge Request.
BL Burst Length
HC Hop Count. Every switch performs shift-right. Control
packet is switched according to RI until HC becomes 0.
Table 3.1.1 TAG field description
HEADER includes TAG and Direct Signal (SIG) field. In Compact
Packet, header is only 1 Byte which is called C_HEADER. Header of
Control Packet contains Hop Count (HC) field instead of SIG field.
Acknowledge Packet consists of TAG.
3.1.2 Route information field (RI)
7b RI contains route information from a master to a slave. The RI
field can is divided into a few sub-field, each corresponds to output port at
each switch hop. RI modification, deleting head index and attaching
flipped source port index, is required to record route path. RI for reverse
path can be obtained by bit-wise flip operation. RI modification and
- 118 -
flipping for reverse path RI are depicted in Figure 3.2.
RI flipping (SNI performs)
J11 J10 J22 J21 J20 J31 J30
J'11J'10 J'22J'21J'20 J'31J'30
J'11J'10 J'22J'21J'20J31 J30
J'11J'10J22 J21 J20 J31 J30
J'31 J'30 J'22 J'21 J'20 J'11 J'10
RI modification (SW performs)
RI for Master -> Slave
RI for Slave -> Master
Index for 1st hop2nd 3rd
Figure 3.1.2 RI modification and flipping
- 119 -
3.2 Packet transaction-level protocol
Master Slave
SIG
RIC=0P W=0 S AC=0 BL = 00A=1 D=0
SIG
RI-1C=0P W=0 S AC=0 BL = 00A=0 D=1
TAG
SIG
ADDR
DATATAG
SIG
Figure 3.2.1 Basic read packet transaction
Master Slave
SIG
RIC=0P W=1 S AC=0 BL = 00A=1 D=1
TAG
SIG
ADDRDATA
Figure 3.2.2 Basic write packet transaction
- 120 -
Master Slave
SIG
RIC=0P=0 W=0 S AC=0 BLA=1 D=0
TAG
SIG
ADDR
DATATAG
SIG
SIG
RI-1C=0P=0 W=0 S AC=0 BLA=0 D=1
DATACH DATAC
H...
RI-1C=1
Figure 3.2.3 Low priority burst read packet transaction
Master Slave
SIG
RIC=0P=0 W=1 S AC=0 BLA=1 D=0
DATA CH DATA C
H
RIC=1
TAG
SIG
ADDRDATA...
Figure 3.2.4 Low priority burst write packet transaction
- 121 -
Master Slave
SIG
RIC=0P=1 W=0 S AC=0 BLA=1 D=0
TAG
SIG
ADDR
DATATAG
SIG
SIG
RI-1C=0P=1 W=0 S0 AC=0 BLA=0 D=1
...DATATAG
SIG
DATATAG
SIG
Figure 3.2.5 High priority burst read packet transaction
Master Slave
SIG
RIC=0P=1 W=1 S0 AC=0 BLA=1 D=1
TAG
SIG
ADDRDATA...TAG
SIG
DATATAG
SIG
DATA
Figure 3.2.6 High priority burst write packet transaction
- 122 -
Master Slave
SIG
RIC=0P W=1 S AC=1 BLA=1 D=1
TAG
SIG
ADDRDATA
% In the case of burst write, only the firstpacket can request acknowledgement.
AP
RI-1C=0P W=1 000 0 00
Figure 3.2.7 Low/high priority, single/burst write w/ ackreq packet
transaction
Master/
Slave
Slave/
Master
SIG
RIC=0P W=1 S=1 AC=0 BL=00A=0 D=0
TAG
SIG
Figure 3.2.8 Direct-signal packet transaction
- 123 -
3.3 Control signals
3.3.1 Serialization
UPS serializes packets to transfer them through 8b LINK. To indicate
packet length, EOP (end-of-packet) signal is transmitted together as shown in
Figure 3.7.1.
3.3.2 Flow control
Masters, switches, and slaves can generate ‘WT???’ signal to control flow
rate. MNI, UPS, DNS, and SNI only bypass or retime the flow control signal.
- 124 -
3.4 NI Timing diagram (basic packet transfer)
3.4.1 Basic read packet transaction
CLK_M
FA0
FHO0
FS0
FAO0
0 0
FPRT0, 0
BH0, BD0
BSO0
BDO0
FD<31:0>
FA<31:0>
FHOEN
FHO<23:0>
FS<7:0>
BDO<31:0>
FDO<31:0>
BHEN,BDEN
BSOEN
BSO<7:0>
FBL<1:0>
FPRT,FWRITE
FSEN
FAEN
FAO<31:0>
FDOEN
FAOEN
BDOEN
FDEN
BH<23:0>,BD<31:0>
Figure 3.4.1 (Master) MNI (UPS) protocol for basic read
Outputs of a master are latched-output. Settled in phase_1.
Forward_path latency is zero [MNI_01].
Before it outputs BS<7:0> and BD<31:0>, MNI examines whether a
- 125 -
packet is destined to the MNI (i.e. Control Packet) or master. If the packet is
destined to the MNI, BSOEN and BDOEN are disabled [MNI_02].
Backward_path latency is zero [MNI_03].
- 126 -
CLK_S
FH0, FA0
0, 0 0
FTO0
FSO0
FAO0
BS0
BD0
BHO0, BDO0
BT0
FHEN,FAEN
FH<23:0>FA<31:0>
FD<31:0>
FWRITE,FBLO<1:0>
FSO<7:0>
FAO<31:0>
FDO<31:0>
FDEN
FSOEN
FAOEN
FDOEN
FTO<15:0>
FTOEN
BSEN
BS<7:0>
BD<31:0>
BDEN
BHOEN,BDOEN
BHO<23:0>,BDO<31:0>
BTEN
BT<15:0>
Figure 3.4.2 (DNS) SNI (Slave) protocol for basic read
SNI does not retime forward_path inputs, thus transfer them as soon
as possible. In other words, forward_path latency is zero [SNI_01].
SNI examines whether a packet is a Control Packet or not. If it is,
FSOEN, FAOEN, and FDOEN are disabled. Then, SNI interprets the Control
Packet [SNI_02].
- 127 -
For read operation, SNI modifies FH, so generates FTO to use it for a
relevant backward_packet. C, P, and BL fields are not changed. RI is flipped.
W is set to ‘H’. SAD is properly set. AC is set to ‘L’. The FTO must be delayed
by the amount of slave latency so that it is output through BT together with BD.
[SNI_03].
Backward_path inputs may be settled in phase_2, thus backward_path
latency is 1 [SNI_04].
- 128 -
3.4.2 Basic write packet transaction
CLK_M
FA0
FHO0
FDO0
FS0
FAO0
FD0
0 0
FPRT0, 1
FD<31:0>
FA<31:0>
FHOEN
FHO<23:0>
FS<7:0>
FDO<31:0>
FBL<1:0>
FPRT,FWRITE
FSEN
FAEN
FAO<31:0>
FDOEN
FAOEN
FDEN
Figure 3.4.3 (Master) MNI (UPS) protocol for basic write
CLK_S
FH0, FA0, FD0
1, 0 0
FSO0
FAO0, FDO0
FHEN,FAEN,FDEN
FH<23:0>,FA<31:0>,FD<31:0>
FWRITE,FBLO<1:0>
FSO<7:0>
FAO<31:0>,FDO<31:0>
FSOEN
FAOEN,FDOEN
FTO<15:0>
FTOEN
Figure 3.4.4 (DNS) SNI (Slave) protocol for basic write
FTOEN is disabled for write operation.
- 129 -
3.5 NI Timing diagram (burst packet transfer)
3.5.1 Burst read packet transaction (low priority)
CLK_M
FA0
FHO0
FS0
FAO0
FBL0
0, 0
BH0, BD0
BSO0
BDO0
BH1, BD1
BDOx
BH1, BDx
FD<31:0>
FA<31:0>
FHOEN
FHO<23:0>
FS<7:0>
BDO<31:0>
FDO<31:0>
BHEN,BDEN
BSOEN
BSO<7:0>
FBL<1:0>
FPRT,FWRITE
FSEN
FAEN
FAO<31:0>
FDOEN
FAOEN
BDOEN
FDEN
BH<23:0>,BD<31:0>
BDO1 Figure 3.5.1 (Master) MNI (UPS) protocol for burst read w/ low priority
All signals are valid during burst_ext_period [MNI_04].
BSOEN is disabled during burst_ext_period [MNI_05].
- 130 -
CLK_S
FH0, FA0
0, FBLO0
FTO0
FSO0
FAO0
BS0
BD0
BHO0, BDO0
BT0
BHO1, BDO1
FHEN,FAEN
FH<23:0>FA<31:0>
FD<31:0>
FWRITE,FBLO<1:0>
FSO<7:0>
FAO<31:0>
FDO<31:0>
FDEN
FSOEN
FAOEN
FDOEN
FTO<15:0>
FTOEN
BSEN
BS<7:0>
BD<31:0>
BDEN
BHOEN,BDOEN
BHO<23:0>,BDO<31:0>
BTEN
BT<15:0>
BD1 BDx
BHO1, BDOx
Figure 3.5.2 (DNS) SNI (Slave) protocol for burst read w/ low priority
SNI supports burst_read operation. BHO outputs C_HEADER if
priority of BT0 is ‘L’ [SNI_05].
- 131 -
3.5.2 Burst write packet transaction (low priority)
CLK_M
FA0
FHO0
FDO0 FDO1
FS0
FAO0
FD1 FD2FD0
FBL0
0, 1
FD3
FDO2
FD<31:0>
FA<31:0>
FHOEN
FHO<23:0>
FS<7:0>
FDO<31:0>
FBL<1:0>
FPRT,FWRITE
FSEN
FAEN
FAO<31:0>
FDOEN
FAOEN
FDEN
FDO3
FHO1 FHO1 FHO1
Figure 3.5.3 (Master) MNI (UPS) protocol for burst write w/ low priority
(FH0, FA0, FD0) constitutes a Normal Packet. (FH1, FD1), (FH1,
FD2), (FH1, FD3) constitute Compact Packets when FPRT is ‘L’
[MNI_06].
FSEN and FAEN are disabled during burst_ext_period [MNI_07]
- 132 -
CLK_S
FD1 FD2
FH0, FA0
FD0
1, FBLO0
FSO0
FAO0
FDO1 FDO2FDO0
FHEN,FAEN
FH<23:0>FA<31:0>
FD<31:0>
FWRITE,FBLO<1:0>
FSO<7:0>
FAO<31:0>
FDO<31:0>
FDEN
FSOEN
FAOEN
FDOEN
FTO<15:0>
FTOEN
FD3
FDO3 Figure 3.5.4 (DNS) SNI (Slave) protocol for burst write w/ low priority
FTOEN is disabled [SNI_06]
- 133 -
3.5.3 Burst read packet transaction (high priority)
CLK_M
FA0
FHO0
FS0
FAO0
FBL0
1, 0
BH0, BD0
BSO0
BDO0
BH1, BD1
BSOx
BDOx
BHx, BDx
FD<31:0>
FA<31:0>
FHOEN
FHO<23:0>
FS<7:0>
BDO<31:0>
FDO<31:0>
BHEN,BDEN
BSOEN
BSO<7:0>
FBL<1:0>
FPRT,FWRITE
FSEN
FAEN
FAO<31:0>
FDOEN
FAOEN
BDOEN
FDEN
BH<23:0>,BD<31:0>
BSO1
BDO1 Figure 3.5.5 (Master) MNI (UPS) protocol for burst read w/ high priority
BSOEN may be asserted during burst_ext_period to receive
multiple Direct Signals [MNI_08].
(BH, BD) constitutes a Normal Packet [MNI_09].
- 134 -
CLK_S
FH0, FA0
0, FBLO0
FTO0
FSO0
FAO0
BS0
BD0
BHO0, BDO0
BT0
BHO1, BDO1
FHEN,FAEN
FH<23:0>FA<31:0>
FD<31:0>
FWRITE,FBLO<1:0>
FSO<7:0>
FAO<31:0>
FDO<31:0>
FDEN
FSOEN
FAOEN
FDOEN
FTO<15:0>
FTOEN
BSEN
BS<7:0>
BD<31:0>
BDEN
BHOEN,BDOEN
BHO<23:0>,BDO<31:0>
BTEN
BT<15:0>
BS1 BSx
BD1 BDx
BHOx, BDOx
Figure 3.5.6 (DNS) SNI (Slave) protocol for burst read w/ high priority
BSEN may be asserted for multiple cycles to transfer multiple
Direct Signals [SNI_07].
- 135 -
3.5.4 Burst write packet transaction (high priority)
CLK_M
FA0
FHO0
FDO0 FDO1
FS0
FAO0
FS1 FS2
FD1 FD2FD0
FBL0
1, 1
FS3
FD3
FDO2
FD<31:0>
FA<31:0>
FHOEN
FHO<23:0>
FS<7:0>
FDO<31:0>
FBL<1:0>
FPRT,FWRITE
FSEN
FAEN
FAO<31:0>
FDOEN
FAOEN
FDEN
FDO3
FHO1 FHO2 FHO3
Figure 3.5.7 (Master) MNI (UPS) protocol for burst write w/ high priority
FSEN may be asserted during burst_ext_period to transfer multiple
Direct Signals [MNI_10].
(FHO, FAO, FDO) constitutes a Normal Packet [MNI_11].
- 136 -
CLK_S
FH1
FD1
FH2
FD2
FH0
FD0
FBLO0
FSO1 FSO2FSO0
FAO0
FDO1 FDO2FDO0
FD3
FH3
FDO3
FSO3
FHEN
FH<23:0>
FD<31:0>
FBLO<1:0>
FSO<7:0>
FAO<31:0>
FDO<31:0>
FDEN
FSOEN
FAOEN
FDOEN
FTO<15:0>
FTOEN
FA0
FAEN
FA<31:0>
FWRITE
Figure 3.5.8 (DNS) SNI (Slave) protocol for burst write w/ high priority
FHEN is asserted for burst_ext_period [SNI_08].
- 137 -
3.6 NI Timing diagram (acknowledge, flow-control)
3.6.1 Acknowledge packet transaction
BH0
CLK_S
FDEN
FACKREQ
BACKn
BHEN
BH<23:0> Figure 3.6.1 Acknowledge request by MNI
BACKn must be asserted if FACKREQ is enabled. Deasserted
when BH includes an Acknowledge Packet [MNI_12].
FH0
BH0
CLK_S
FHEN
FH<23:0>
BHOEN
BHO<23:0>
BDO<31:0>
BDOEN
Figure 3.6.2 Acknowledge reply by SNI
If acknowledge request field of FH0 is enabled, SNI replies Acknowledge
Packet in next clock cycle [SNI_09].
- 138 -
3.6.2 Flow control
CLK_M
FA0
FHO0
FDO0 FDO1
FS0
FAO0
FS1
FD1FD0
FBL0
FPRT0, 1
FS2 FS3
FD2 FD3FD<31:0>
FA<31:0>
FHOEN
FHO<23:0>
FS<7:0>
FDO<31:0>
FBL<1:0>
FPRT,FWRITE
FSEN
FAEN
FAO<31:0>
FDOEN
FAOEN
FDEN
FDO2
FHO1 FHO1 FHO1
FDO3 Figure 3.6.3 Flow-controlled burst write at MNI
BH1, BD1BH0, BD0
BSO0
BDO1BDO0
BH2, BD2
BDO3
BH3, BD3
BDO<31:0>
BHEN,BDEN
BSOEN
BSO<7:0>
BDOEN
BH<23:0>,BD<31:0>
BSO1
BDO2
CLK_S
BSO2 BSO3
Figure 3.6.4 Flow-controlled burst read at MNI
Burst forward_packet of an MNI is flow-controlled by disabling
FDEN and FSEN [MNI_13].
Burst backward_packet of an MNI is flow-controlled by disabling
BHEN and BDEN [MNI_14].
- 139 -
CLK_S
FH1
FD1
FH0
FD0
FBLO0
FSO1FSO0
FAO0
FDO1FDO0
FD2 FD3
FH2 FH3
FDO2
FSO2
FHEN
FH<23:0>
FD<31:0>
FBLO<1:0>
FSO<7:0>
FAO<31:0>
FDO<31:0>
FDEN
FSOEN
FAOEN
FDOEN
FTO<15:0>
FTOEN
FA0
FAEN
FA<31:0>
FWRITE
FSO3
FDO3
Figure 3.6.5 Flow-controlled burst write at SNI
BS0
BD0
BT0
BHO2, BDO2BHO1, BDO1BHO0, BDO0
BS1 BS2
BD1
BHO3, BDO3
BSEN
BS<7:0>
BD<31:0>
BDEN
BHOEN,BDOEN
BHO<23:0>,BDO<31:0>
BTEN
BT<15:0>
CLK_S
BS3
BD2 BD3
Figure 3.6.6 Flow-controlled burst read at SNI
Burst forward_packet of an SNI is flow-controlled by disabling
FHEN and FDEN [SNI_10].
Burst backward_packet of an SNI is flow-controlled by disabling
BSEN and BDEN [SNI_11].
- 140 -
3.7 UPS/DNS Timing diagram
When a UPS serializes a packet, it is serialized by the order of Header,
Address, and Data. The operation of the UPS and DNS is regardless of
packet transfer types like read/write, basic/burst, and acknowledgement.
EOP signal is used to indicate the end of the packet.
SRCA0 SRCA1SRCA<31:0>
CLK_M
CLK_UPS
SRCD1SRCD<31:0>
SRCHEN
SRCH0 SRCH1 SRCH2SRCH<23:0>
SRCAEN
SRCDEN
SRCD2
(1) (2) (3)
L0 L1 L2 L3 L4 L5 L6
EOP
DATASTB
LINE<7:0>
CLK_UPS
L0 L1 L2 L3 L4 L5 L6 L7
L0 L1
EOP
DATASTB
LINE<7:0>
EOP
DATASTB
LINE<7:0>
H H H A A A A
H H H A A A A D
H D
(1)
(2)
(3)
L8 L9 L10D D D
L2 L3 L4D D D
tUPSd
Figure 3.7.1 (MNI) UPS (SW)
The minimum delay time, tUPSd, which is required by UPS to start
- 141 -
serialization, is open for actual design. This specification does not
fix the value.
L0 L1 L2 L3 L4 L5 L6 L7H H H A A D DD
DSTHO0
DSTAO0
DSTDO0
CLK_DNS
EOP
DATASTB
LINE<7:0>
CLK_S
DSTHOEN
DSTHO<31:0>
DSTAOEN
DSTAO<31:0>
DSTDOEN
DSTDO<31:0>
DL8 L9 L10
A A
tDNSd
Figure 3.7.2 (SW) DNS (SNI)
The minimum delay time, tDNSd, which is required by DNS to
finish de-serialization, is open for actual design. This specification
does not fix the value.
- 142 -
3.8 SW Timing diagram
L0 L1 L2 L3 L4 L5 L6
CLK_SW
EOPOx
DATASTBOx
LINEOx<7:0>
L0 L1 L2 L3 L4 L5H H H D D D
EOPJx
DATASTBJx
LINEJx<7:0> L6D
tSWd
Figure 3.8.1 Switch delay timing diagram
The switching delay time, tSWd, is open for actual design. This
specification does not fix the value.
L0 L1 L2 L3 L4 L5 L6H H H A A A A
CLK_SW
EOPOx
DATASTBOx
LINEOx<7:0>
WTSWREQx
L0 L1 L2 L3 L4 L5
L0 L1 L2 L3 L4 L5H H H D D D
EOPJx
DATASTBJx
LINEJx<7:0> L0 L1 L2 L3
WTUPSx
L6D
tWT_UPS_R
L4
tWT_UPS_F
tWT_SW_FtWT_SW_R
Figure 3.8.2 Switch timing for flow-control
The minimum setup time to request wait of a switch, tWT_SW_R,
is 2 clock cycles [SW_01].
- 143 -
The minimum delay time from de-assertion of WTSWREQ to
LINEO, tWT_SW_F, is 1 clock cycle [SW_02].
The minimum setup time of WTUPS to request wait of a UPS
before the end of previous packet, is open for actual design. It may
depend on the propagation delay and response time of the UPS.
The minimum delay time from de-assertion of WTUPS to LINEJ is
1 clock cycle. [SW_03]
- 144 -
국문 요약
패킷-스위칭 네트워크-온-칩 (NoC)이 고성능 SoC를 위해 저전력으로
설계되었고 실리콘 공정으로 제작되었다. 본 연구는 NoC 구조
결정에서부터 시스템 시연까지 전체적인 NoC설계 방법에 대한 것이다.
우선 Topology 결정을 위하여 성능 및 전력, 면적에 관한 비교 분석을
하였다. 버스, Mesh, Star, Point-to-point 와 같은 Basic topology뿐 아니라, 이
Basic Topology들로 구성된 Hierarchical / Heterogeneous Topology에 대해서도
비교를 하였다.
둘째로 NoC 구조 및 구성 요소에 관하여는 Switching방법, 패킷 동기화,
통신선 직렬화, 프로토콜 그리고 Buffering 방법 등을 분석하였다.
제작된 칩은 Multiprocessor의 에뮬레이션을 위한 두 개의 RISC 프로세서와
두 개의 64kbit SRAM, 온-칩-FPGA, 칩-외부-네트워크와의 연결을 위한 Off-
chip-Gateway, Peripheral logic의 에뮬레이션을 위한 3개의 4kbit SRAM, 1.6GHz
의 PLL, 그리고 이들간의 통신수단으로서 온-칩-네트워크가 집적되었다. 이
온-칩-네트워크의 채널은 온-칩 면적과 복잡도를 획기적으로 줄이기 위해
80bit에서 8bit으로 직렬화되었다. 또한 서로 다른 Clock 주파수로 동작하는
여러 온-칩-유닛 사이의 Plesiochronous 통신을 위해 Source-synchronous
signaling을 사용하고 있다. 본 논문에서는 다음과 같은 온-칩-네트워크에서
- 145 -
의 여러 저전력 기술을 제안 및 응용하였다. 채널의 Low-swing signaling,
Mux-Tree방식의 Round-robin 스케쥴러, 크로스바 부분 활성화 기술, 직렬통
신에서의 저전력 채널코딩, 동작 주파수 스케일링 등이다. 이 칩은 최고
160mW를 소모하고, 제안된 온-칩-네트워크는 51mW를 소모하며 11.2G/s의
통신 대역폭을 제공한다. 0.18µm CMOS 공정으로 제작된 25 mm2 면적의 이
칩은 회로의 동작이 검증되었으며 멀티미디어 어플리케이션을 시연하고 있
다. 마지막으로 제작된 칩 4개를 하나의 Package에 집적하여, 더 큰 시스템
을 구성할 수 있도록 하는 네트워크-인-패키지 (NiP) 기술을 제안 및 제작
하고 이를 측정하였다.
- 146 -
Bibliography
[1] International Technology Roadmap for Semiconductors, http://public.itrs.net
[2] D. Bertozzi et al., “Xpipes: A Network-on-Chip Architecture for Gigascale
System-on-Chip,” IEEE Circuits and Systems Magazine, vol. 4, issue 2, pp.18-31,
2004.
[3] L. Benini et al., “Networks on Chips: A New SoC Paradigm,” Computer, vol. 36,
pp. 70 – 78, Jan. 2002.
[4] S. Kumar, et al., “A Network on Chip Architecture and Design Methodology,” in
Proc. IEEE Computer Society Annual Symposium on VLSI, Apr. 2002, pp. 105-112.
[5] S. Murali, et al., “SUNMAP: A Tool for Automatic Topology Selection and
Generation for NoCs,” in Proc. Design and Automation Conf., June 2004, pp 914-919.
[6] E. Rijpkema, et al., “Trade Offs in the Design of a Router with Both Guaranteed
and Best-Effort Services for Networks on Chip,” in Proc. Design, Automation and Test
Conf. March 2003, pp. 350-355.
[7] F. Worm, et al., “An Adaptive Low-Power Transmission Scheme for On-chip
Networks,” in Proc. Int. Symposium on System Synthesis, Oct. 2002, pp. 92-100.
[8] V. Nollet, et al., “Operating-System Controlled Network on Chip,” in Proc. Design
and Automation Conf., June 2004, pp 256-259.
- 147 -
[9] S.-J. Lee, et al., An 800MHz Star-Connected On-Chip Network for Application to
Systems on a Chip,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb.
2003, pp. 468-469.
[10] M. Taylor, et al., “A 16-Issue Multiple-Program-Counter Microprocessor with
Point-to-Point Scalar Operand Network,” in IEEE Int. Solid-State Circuits Conf. Dig.
Tech. Papers, Feb. 2003, pp. 170-171.
[11] K. Lee, et al., “A 51mW 1.6GHz On-Chip Network for Low-Power
Heterogeneous SoC Platform,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech.
Papers, Feb. 2004, pp. 152-153.
[12] W. Dally, et al., “Route Packets, Not Wires: On-Chip Interconnection Networks,”
in Proc. Design and Automation Conf., June 2001, pp 684-689.
[13] S. Kimura, et al., “An On-Chip High Speed Serial Communication Method Based
on Independent Ring Oscillators,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech.
Papers, Feb. 2003, pp.390-391
[14] R. Ho. et al., “Efficient On-Chip Global Interconnects,” in Symp. VLSI Circuits
Dig. Tech. Papers, June 2003, pp. 271-274.
[15] C. Svensson, “Optimum Voltage Swing on On-Chip and Off-Chip interconnect,”
IEEE J. of Solid-State Circuits, vol. 36, pp. 1108 - 1112, July 2001.
[16] S. Shahrier, et al., “A Fast Round Robin Priority Port Scheduler for High
- 148 -
Capacity,” in Proc. IEEE International Conference on ATM, April 2001, pp. 173-180.
[17] P. Gupta, et al., “Design and Implementing a Fast Crossbar Scheduler,” IEEE
Micro, vol. 19, pp. 20-28, Jan. 1999.
[18] K. Lee, et al., “A Variable Round-Robin Arbiter for High Speed Buses and
Statistical Multiplexes,” in Proc. Int. Phoenix Conference on Computers and
Communications, March 1991, pp 23-29.
[19] E. Shin, et al., “Round-robin Arbiter Design and Generation,” in Proc. IEEE Int.
Symp. System Synthesis, pp 243-248, Oct. 2002.
[20] P. Landman, et al., "Architectural Power Analysis: The Dual Bit Type Method,"
IEEE Trans. VLSI Syst., vol.3, pp. 173-187, June 1995.
[21] K. Lee, et al., “SILENT: Serialized Low-Energy Transmission Coding for On-
Chip Interconnection Networks,” in IEEE Int. Conf. Computer Aided Design Dig.
Tech. Papers, Nov. 2004, pp. 448-451.
[22] R. Woo, et al., “A 210-mW Graphics LSI Implementing Full 3-D Pipeline With
264 Mtexels/s Texturing for Mobile Multimedia Applications,” IEEE J. of Solid-State
Circuits, vol. 39, pp. 358 - 367, Feb. 2004.
[23] J.-S. Kim, et al., “On-Chip Network based Embedded Core Testing,” in Proc.
IEEE Int. SOC Cof., pp. 223-226, Sept. 2004.
[24] KAIST Network-on-Chip working group http://ssl.kaist.ac.kr/ocn
- 149 -
[25] D. Geer, “Chip Makers Turn to Multicore Processors,” IEEE Computer, vol. 38,
issue 5, pp.11-13, May 2005.
[26] D. Pham, et al., “The Design and Implementation of a Fist-Generation CELL
processor,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, 2005, pp. 184-
185.
[27] S. Torii, et al., “A 600MIPS 120mW 70 A Leakage Triple -CPU Mobile
Application Processor Chip,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers,
2005, pp. 136-137.
[28] P. Guerrier, et al., “A Generic Architecture for On-Chip Packet-Switched
Interconnections,” In Proc. Conf. on Design Automation and Test in Europe, 2000, pp.
250-256.
[29] H. Wang, et al., “A Technology-aware and Energy-oriented Topology Exploration
for On-chip Networks,” In Proc. Conf. on Design Automation and Test in Europe,
2005, pp. 1238-1243.
[30] M. Kreutz, et al., “Energy and Latency Evaluation of NoC Topologies,” In Proc.
Int. Symp. on Circuits and Systems, 2005, pp. 5866-5869.
[31] V. George, et al., “The Design of a Low Energy FPGA,” In Proc. Int. Symp. on
Low-Power Electronics and Design, 1999, pp.188-193.
[32] H. Zhang, et al., “Low-Swing On-Chip Signaling Techniques: Effectiveness and
- 150 -
Robustness,” IEEE Trans. VLSI systems, vol. 8, pp. 264-272, June 2000.
[33] AMBATM Specification, Rev. 2.0, 1999, www.arm.com
[34] M. Karol, et al., “Input Versurs Output Queueing on a Space-Division Packet
Switch,” in IEEE Transactions on Communications, vol. 35, no. 12, pp. 1347-1356,
December 1987.
[35] AMBA AXI Protocol Specification, Rev. 0.0, 2003, www.arm.com
[36] J. Duato, et al., “Interconnection Networks, an engineering approach,” Morgan
Kaufmann 2003.
[37] J. Hennessy and D. Patterson, “Computer Architecture: A Quantitative Approach,
3rd edition,” Morgan Kaufmann, p.489.
[38] F. Angiolini, et al., “Contrasting a NoC and a Traditional Interconnect Fabric with
Layout Awareness,” In Proc. Conf. on Design Automation and Test in Europe, 2006.
[39] Y. Moisiadis, et al., “High Performance Level Restoration Circuits for Low-
Power Reduced-swing Interconnection Schemes, ” in Proc. of Int. Conf. on
Electronics Circuits and Systems, Dec. 2000, pp.619-622.
[40] R. Golshan, et al., “A novel reduced swing CMOS BUS interface circuit for high
speed low power VLSI systems,” in Proc. of IEEE Int. Symp. Circuits and Systems,
May 1994, pp.351-354.
[41] Y. Nakagome, et al., “Sub-1-V Swing Internal Bus Architecture for Future Low-
- 151 -
Power ULSI’s,” IEEE J. of Solid-State Circuits, vol. 28, pp. 414 - 419, April 1993.
[42] G. C. Cardarilli, et al., “Low Voltage Swing Circuits for Low Dissipation Buses,”
in Proc. of Int. Symp. on Circuits and Systems, June 1997, pp.1868-1871.
[43] M. Hiraki, et al., “Data-Dependent Logic Swing Internal Bus Architecture for
Ultralow-Power LSI’s, ” IEEE J. of Solid-State Circuits, vol. 30, pp. 397 - 402, April
1995.
[44] S.-J. Lee, et al., “Packet-switched on-chip interconnection network for system-
on-chip applications,” IEEE Trans. Circuits and Systems II, vol. 52, pp.308-312, June
2005.
[45] S.-J. Lee, et al., “Adaptive network-on-chip with wave-front train serialization
scheme,” in IEEE Symp. on VLSI Circuits Dig. Tech. Papers, June 2005, pp. 104-107.
[46] K. Lee, et al., “Low-Power Network-on-Chip for High-Performance SoC
Design,” IEEE Trans. VLSI Systems, accepted for publication.
[47] K. Lee, et al., “A High-Speed and Lightweight On-Chip Crossbar Scheduler for
On-Chip Interconnection Networks,” in Proc. of IEEE European Solid-State Circuits
Conf., Sept. 2003, pp.453-456.
[48] K. Lee, et al., “A distributed crossbar switch scheduler for on-chip networks,” in
Proc. of IEEE Custom Integrated Circuits Conf., Sept. 2003, pp.671-674.
[49] M. R. Stan, et al., “Bus-Invert Coding for Low-Power I/O,” IEEE Trans. VLSI
- 152 -
systems, vol. 3, pp. 49-58, March 1995.
[50] H. Mehta, et al., “Some Issues in Gray Code Addressing,” in Proc. of Great Lakes
Symp. on VLSI, Mar. 1996, pp.178-181.
[51] L. Benini, et al., “Asymptotic zero-transition activity encoding for address busses
in low-power microprocessor-based systems,” in Proc. of Great Lakes Symp. on VLSI,
March 1997, pp.77-82.
[52] Y. Shin, et al., “Partial Bus-Invert Coding for Power Optimization of System
Level Bus,” in Proc. of Int. Symp. on Low Power Electronics and Design, Aug. 1998,
pp.127-129.
[53] S. Ramprasad, et al., “A Coding Framework for Low-Power Address and Data
Busses,” IEEE Trans. VLSI systems, vol. 7, pp. 212-221, June 1999.
[54] C. Kretzschmar, et al., “Why Transition Coding for Power Minimization of on-
Chip Buses does not work,” in Proc. of the Design Automation and Test Europe Conf.
(DATE), February 2004, pp.512-517.
[55] Y. Shin, et al., “Narrow Bus Encoding for Low-Power DSP Systems,” IEEE Trans.
VLSI systems, vol. 9, pp. 656-660, Oct. 2001.
[56] H. Zhang, et al., “A 1V Heterogeneous Reconfigurable Processor IC for
Baseband Wireless Applications,” IEEE Int. Solid-State Circuits Conf., Feb. 2000, pp.
68-69.
- 153 -
감사의 글
이 학위 논문 표지에는 제 이름만 적혀 있지만, 제가 박사 학위를 받기까지 지난 6년간의 대학원 과정 동안 많은 도움을 주신 분들이 있습니다. 그 모든 분들께 감사의 뜻을 전하고 싶습니다. 우선 석사 2년 반, 박사 3년 반의 짧지 않은 기간 동안 변함없는 열정과 올바른 연구자로서의 모범을 몸소 보이시며 날카롭고 통찰력 있는 조언을 아끼지 않으셨던 유회준 교수님께 진심으로 감사 드립니다. 또한 바쁘신 와중에도 저의 부족한 논문을 주의 깊게 봐주시고 조언을 주셨던 박규호 교수님, 김정호 교수님, 신영수 교수님, 그리고 이혁재 교수님께도 깊은 감사의 말씀을 드립니다. 지난 6년 동안 최고의 연구실을 지향하며 과거/현재/미래를 항상 고민해 왔던 SSL Family에게도 깊은 고마움을 표현하고 싶습니다. 먼저, 사회에 나아가 각자의 높은 꿈을 향해 달려가고 있는 선배님들 – 세정이형, 용하형, 치원이형, 주호형, 진호형, 선호형, 람찬이형, 세중이형, 정훈이형, 재원이형 – 그리고 동기 및 후배들 – 재서, 성은, 진경, 민욱 – 어떤 형태로든 함께 했던 모든 경험과 대화가 저에겐 긍정적이고 건설적인 자극이었습니다. 그리고 잠시였지만 유익한 조언과 경험을 나누어 주셨던 박성민 교수님, 이재열 박사님, 서지선 책임님께도 감사를 드리고 싶습니다. 또한, 현재 서로의 머리를 맞대고 SSL을 힘차게 밀어 가고 있는 연구실 멤버들 – 병규형, (손)주호, 성대, 성준, 교민이형, 정호, 동현, 선영, 남준, 관호, 혜정, 담이, 주영 – 모두에게 깊은 고마움을 전합니다. 또한 궂은 업무들을 훌륭하게 도와주셨던 홍은수씨와 Network-in-Package 제작을 위해서 많은 노력을 함께 해준 김가원씨께도 감사의 말을 전합니다. 마지막으로 큰 사랑을 주시며 항상 걱정해주시는 어머님과 하늘에서 저를 지켜 봐주시는 아버지께 이 논문을 드립니다.
SEE YOU AT THE TOP!
- 154 -
KANGMIN LEE kangmin@eeinfo.kaist.ac.kr
http://ssl.kaist.ac.kr/~kangmin
EDUCATION
Korea Advanced Institute of Science and Technology (KAIST) - Full Scholarship from Korea Government
9/02 - 2/06 Ph.D. in Electrical Engineering Dissertation: Design and Implementation of Low-Power Network-on-Chip for Application to High-Performance System-on-Chip Design
3/00 - 8/02 M.S. in Electrical Engineering Dissertation: Design and Implementation of a 80Gbps Shared Bus Packet Switch using Embedded DRAM
3/96 - 2/00 B.S. in Electrical Engineering - Magna Cum Laude Overall GPA: 3.84/4.30 - Major GPA: 3.80/4.3
University of California Berkeley, CA, US 6/00 - 8/00 Visiting student in Computer Science WORK EXPERIENCE
Korea Advanced Institute of Science and Technology (KAIST) 3/00 - Present Research Assistant - Perform research mainly focusing on various
circuits, architectures, protocols, algorithms and systems design and chip implementation. Major research area includes switches and on-chip interconnection networks.
3/00 - Present Teaching Assistant - Assist teaching for Electronic Laboratories, ASIC and Computer architecture courses
Micrel Lab. (Prof. Luca Benini), University of Bologna, Italy 4/05 - 5/05 Research Assistant - Perform a cross-benchmarking of AMBA
Multilayer Bus and Xpipes Network-on-Chip in aspects of latency, area, power consumption.
SAMSUNG Electronics, Ki-Heong, Korea 1/99 - 2/99 Winter Intern - Intern in Liquid Crystal Display division R&D team.
Research Assistant - Designed and implemented an OSD (On Screen Display) Controller chip for TV by using Mentor CAD System.
- 155 -
RESEARCH PROJECTS
BONE (Basic On-Chip Network) Development of On-Chip Interconnection Packet Switch Network for a SoC 5/04 - Present Leading and managing an On-Chip Network Team 2/03 - 3/04 Responsible for full chip architecture and design of a multimedia
application SoC with low-power on-chip networks [ISSCC2005] 7/02 - 12/02 Responsible for high-speed and light-weight crossbar scheduling
algorithm and its implementation
HOB (Hierarchical Output Buffer) Development of 10Gb-Ethernet 8x8 shared-bus switch fabric with Embedded DRAM
6/01 - 6/02 Responsible for full chip architecture, design and layout including memory and logic
RAMP (RAM Processor) Development of Application Specific Embedded Memory Logic Design Technology
2/00 - 6/00 SRAM Cell Layout, Calibre DRC/LVS Rule File Creation
POPeye (Probe Of Performance) Development of a Simulator for a DRAM Architecture Performance Evaluation on variable System Configuration with Real-life Applications for Windows Platform.
8/00 - 12/00 Responsible for Performance Evaluation and Analysis of a DDR-SDRAM, a Direct-Rambus DRAM, and DDR-Fast-Cycle-RAM.
INTERNATIONAL JOURNAL PAPERS (2 FIRST AUTHORED) TVLSI 2006
Low-Power Network-on-Chip for High-Performance SoC Design Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo. IEEE Transactions on VLSI Systems (Accepted for publication)
D&T Magazine 2005
Analysis and Implementation of Practical Cost-Effective Network-on-chips Se-Joong Lee, Kangmin Lee, and Hoi-Jun Yoo IEEE Design & Test Computers Magazine, Sept-Oct. 2005
TCAS-II 2005
Packet-Switched On-Chip Interconnection Network for System-on-Chip Applications Se-Joong Lee, Kangmin Lee, Seong-Jun Song and Hoi-Jun Yoo IEEE Transactions on Circuits and Systems II, Vol. 52, No. 6, June 2005
- 156 -
JSSC 2002
A Reconfigurable Multilevel Parallel Texture Cache Memory with 75-GB/s Parallel Cache Replacement Bandwidth Se-Jeong Park, Jeong-Su Kim, Ramchan Woo, Se-Joong Lee, Kangmin Lee, Tae-
Hum Yang, Jin-Young Jung and Hoi-Jun Yoo IEEE Journal of Solid-State Circuits Vol. 37, No. 5, May 2002
JSSC 2001
An 80/20-MHz 160-mW Multimedia Processor Integrated With Embedded DRAM, MPEG-4 Accelerator, and 3D Rendering Engine for Mobile Applications Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee,
and Hoi-Jun Yoo IEEE Journal of Solid-State Circuits Vol. 36, No. 11, November 2002
Journals 2001
POPeye: A Simulator for a DRAM Performance Evaluation Kangmin Lee, Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, and Hoi-Jun Yoo IEEK Journal of Semiconductor Technology and Science, Vol. 1, No. 2, June 2001
INTERNATIONAL CONFERENCE PAPERS (9 FIRST AUTHORED) A-SSCC 2005 Outstanding Design Award
Networks-on-Chip and Networks-in-Package for High-Performance SoC Platforms Kangmin Lee, Se-Joong Lee, Donghyun Kim, Kwanho Kim, Gawon Kim, Joungho Kim, and Hoi-Jun Yoo. IEEE Asian Solid-State Circuits Conference (Outstanding Design Award) 2005
ISCAS 2005
An Arbitration Look-Ahead Scheme for Reducing End-to-End Latency in Networks-on-Chip Kwanho Kim, Se-Joong Lee, Kangmin Lee, and Hoi-Jun Yoo. IEEE International Symposium on Circuits and Systems 2005
ISCAS 2005
A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on-Chip Donghyun Kim, Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo. IEEE International Symposium on Circuits and Systems 2005
ICCAD 2004
SILENT: Serialized Low Energy Transmission Coding for On-Chip Interconnection Networks Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE International Conference on Computer Aided Design 2004
SOCC 2004
Low Energy Transmission Coding for On-Chip Serial Communications Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE International SOC Conferenc 2004
SOCC 2004
On-Chip Network Based Embedded Core Testing Jong-Sun Kim, Min-Su Hwang, Seungsu Roh, Ja-Young Lee, Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE International SOC Conference 2004
- 157 -
ISSCC 2004
A 51mW 1.6GHz On-Chip Network for Low-Power Heterogeneous SoC Platform
Kangmin Lee, Se-Joong Lee, Sung-Eun Kim, Hye-Mi Choi, Donghyun Kim, Sunyoung Kim, Min-Wuk Lee and Hoi-Jun Yoo IEEE International Solid-State Circuits Conference 2004
CICC 2003
A Distributed On-Chip Crossbar Switch Scheduler for On-Chip Networks Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo
IEEE Custom Integrated Circuits Conference 2003
ESSCIRC 2003
A High-Speed and Lightweight On-Chip Crossbar Switch Scheduler for On-Chip Interconnection Networks Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE European Solid State Circuits Conference 2003
ESSCIRC 2003
A 10Gbps/port 8x8 Shared Bus Switch with embedded DRAM Hierarchical Output Buffer Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE European Solid State Circuits Conference 2003
ISSCC 2003
An 800MHz Star-Connected On-Chip Network for Application to Systems on a Chip Se-Joong Lee, Seong-Jun Song, Kangmin Lee, Jeong-Ho Woo, Sung-Eun Kim, Byeong-Gyu Nam, and Hoi-Jun Yoo IEEE International Solid-State Circuits Conference 2003
GLOBECOM 2002
A Practical Method to use eDRAM in the Shared Bus Switch Kangmin Lee, Se-Joong Lee, and Hoi-Jun Yoo IEEE Global Telecommunications Conference 2002
ISCAS 2001
A Comparative Analysis of a DDR-SDRAM, a D-RDRAM and a DDR-FCRAM using a POPeye Simulator Kangmin Lee, Chi-Weon Yoon, Ramchan-Woo, Jeonghoon Kook, Ja-Il Ku, and Hoi-Jun Yoo IEEE International Symposium on Circuits and Systems 2002
VLSI
2001
120mW Embedded 3D Graphics Rendering Engine with 64Mb Logically Local Frame Buffer and 3.2GByte/s Run-time Reconfigurable Bus for PDA-chip Ramchan Woo, Chi-Weon Yoon, Jeognhoon Kook, Se-Joong Lee, Kangmin Lee, Yong-Ha Park, and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits 2001
VLSI
2001
Low Power Motion Compensation Block IP with embedded DRAM Macro for Portable Multimedia Applications Chi-Weon Yoon, Jeognhoon Kook, Ramchan Woo, Se-Joong Lee, Kangmin Lee, and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits 2001
- 158 -
VLSI
2001
A Reconfigurable Multilevel Parallel Graphics Cache Memory with 75-GB/s Parallel Cache Replacement Bandwidth Se-Jeong Park, Jeong-Su Kim, Ramchan Woo, Se-Joong Lee, Kangmin Lee, Tae-Hum Yang, Jin-Young Jung and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits 2001
ISSCC 2001
80/20-MHz 160-mW Multimedia Processor Integrated With Embedded DRAM, MPEG-4 Accelerator, and 3D Rendering Engine for Mobile Applications
Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, Young-Don Bae, In-Cheol Park, and Hoi-Jun Yoo IEEE International Solid-State Circuits Conference 2001
DOMESTIC PAPERS
Conferences 2001
POPeye: A Simulator for a DRAM Performance Evaluation Kangmin Lee, Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, and Hoi-Jun Yoo Korea Conference on Semiconductors 2001
PATENTS
1. Low Power Crossbar Switch Fabric Kangmin Lee and Hoi-Jun Yoo Korea Patent 10-2004-17745 (pending)
2. Serial Data Transmitter-Receiver And Method Thereof Kangmin Lee and Hoi-Jun Yoo Korea Patent 10-2004-31840 (pending)
AWARDS
1. Outstanding Award at A-SSCC Student Design Contest 2005 Kangmin Lee, Se-Joong Lee, Donghyun Kim, Kwanho Kim, Gawon Kim, Jougnho Kim,
Hoi-Jun Yoo Networks-on-Chip and Networks-in-Package for High-Performance SoC Platforms
2. The Silver Prize at 5th National IC Design Contest 2004 Kangmin Lee, Se-Joong Lee, Donghyun Kim
- 159 -
Design and Implementation of Multimedia SoC using High-Performance On-Chip Network
3. The Best Design Award from Korea Prime-Minister at 3rd National IC Design Contest 2002
Kangmin Lee, Jaeseo Lee Design and Implementation of an 80Gbps Shared-bus Switch with eDRAM
PROFESSIONAL ACTIVITIES
1. Member of Technical Program Committees: DATE 2006 (http://www.date-conference.com/)
2. International Invited Seminars
(1) Dagstuhl Seminar: Power-Aware Computing Systems (http://www.dagstuhl.de/)
(2) IMEC Regular Seminar: Network-on-Chip and Network-in-Package (http://www.imec.be) RESEARCH INTEREST
1. High-Speed and Low-Power On-Chip Interconnection Network Architecture and its Silicon Design for SoC Platform
2. Gigabit Network Switch Design with Embedded DRAM Technology SKILLFUL TOOLS
High-level Simulation: C/C++, SystemC Logic Design: Verilog HDL, Synopsis Design Compiler, Astro P&R Tools Circuit Design: Cadence Opus, Hspice, Synopsys nanosim Layout Art: Cadence Opus, SKILL, Calibre Workstation: UNIX (Solaris OS) On-chip Interconnection Protocols: BONE, AMBA (AHB and AXI), IBM CoreConnect, OCP-IP
LANGUAGE
Korean as a mother tongue, Proficient English, Beginning Japanese, Chinese and Italian
Recommended