Multi-processor System on Chip...

Preview:

Citation preview

© Jun Dong Cho, 2007.7 1

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Multi-processor System on Chip Design

성균관대 조준동

© 조준동, 2007년 여름 2

목 차

• 차세대 SoC (System on Chip)의 요구사항

• MPSOC의 필요성

• History of Multiprocessors• MP-SoC Examples and Applications• Homogeneity and Heterogeneity• MP-SoC Design Automation• Network on Chip• SKKU’s Mobile MP-SoC Platform

© Jun Dong Cho, 2007.7 3

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

차세대 SoC (System on Chip)의요구사항

© 조준동, 2007년 여름 4

Processor: AP - MCModem: GSM/GPRS - WCDMA - CDMA2000

Connectivity: Wireless LAN - GPS - Bluetooth

RF/Analog: Rx - Tx - Zero IF - PM

Camera Chipset: CIS - CCD - ISP

Display Driver IC (DDI): STN - TFT - OLED

Smart Card: Smart Card: SIMSIM

Flash Memory: Flash Memory: Code/Data StorageCode/Data Storage

SIP / MCPSIP / MCP

RAM: Mobile DRAM - SRAM - UtRAM

What is System on Chip? What is System on Chip?

SoCSoC

© 조준동, 2007년 여름 5

고성능 및 저전력의 필요성

3D graphics

Moore’s law

ShannonShannon’’s laws law((2.8x / 18m)

2G (IS-95)9.6kbps

3G (CDMA 1xEV)3,100kbps

4G (1GMbps~100Mbps)

20031995 2012

Battery capacityQVGA

D1

HD (720p)

Full HD (1080i)Mobile MultimediaMobile Multimedia

Design Complexity

Productivity Gap: Design complexity vs. Moore’s law Power Gap: Design complexity vs. Battery

© 조준동, 2007년 여름 6

임베디드 프로세서(ARM) 0.5 MOPS/mW

신호처리 프로세서ASIPs, DSPs

3 MOPS/mW

신호처리ASIC

가용성

에너

지효

율(M

OPS

/mW

)

0.1

1

10

100

1000

200 MOPS/mW

10-80 MOPS/mWFPFA

6

Flexibility-Energy Gap

FPFA : Field Programmable Function Array

Sensor network design space

Wireless embedded systems design space

© 조준동, 2007년 여름 7

차세대 SoC의 생산성 증대를 위한 5가지 요구사항

1. High Performance 2. Fast Verification3. Small Form Factor4. Low Power Solutions5. Design-Technology Integration for

Manufacturability

© 조준동, 2007년 여름 8

1. High-Performance: CMP +NoC

Heterogeneous Chip MultiHeterogeneous Chip Multi--processor Architectureprocessor Architecture

μP

IP

Mem

IP

PE

PE

PE

μP

Mem

PE

NoC

0

50

100

150

200

250

300

350

400

2004 2007 2010 2013 2016

#. PEs

Source: ITRS 2005 draft

Technology Evolution

© 조준동, 2007년 여름 9

2. Fast Verification: Embedded System Level

ComplexityComplexity

MooreMoore’’s Laws Law2x / 18m2x / 18m

NielsenNielsen’’s Laws Law2x / 12m2x / 12mEmbedded SWEmbedded SW

2x / 10m2x / 10m

System specification

Architecture design

RTL design

UML / Java / MatLab

SystemC / ADL

Verilog / VHDL

ESL

ctrl1/cmd1/

Req

Addr

Grant

Data

ack1ack0

TLMTLM

© 조준동, 2007년 여름 10

MobileMobileAPAP

32MB32MBNANDNAND

16MB16MBSDRAMSDRAM

~35mm~35mm

~2

5m

m~

25

mm16MB16MB

SDRAMSDRAM

17mm17mm

17mm

17mm

FlashSDRAM

SDRAM

Mobile AP

EMI ReductionEMI Reduction

60% Smaller Area60% Smaller Area

▷▷88--layers of MCPlayers of MCP▷▷ Cost reduction by 15%Cost reduction by 15%

3. Small Form Factor

SiPSiP: Mobile Application Processor + Mobile Memory: Mobile Application Processor + Mobile Memory

© 조준동, 2007년 여름 11

• MTCMOS

• Clock Gating

• Multi-Vdd

• Tr Sizing • VTCMOS• Multi-Vt• SOI

• High-κ Metal Gate

Device Circuit Architecture Runtime•Parallelization•GALS

DAC 2004

4. Low Power Solutions

DVFS

1.2V, 350MHz

1.5V, 500MHz

1.0V200MHz

Multi-Vdd

•DPM/DVS

Active

Active

Standby

Standby

VBP

VBN

VDD

VSS

VTCMOS

• MTCMOS

© 조준동, 2007년 여름 12

Statistical Analysis

CriticalTiming,power

Designer’sIntention

?

?

5. Design-Technology Integration for Manufacturability (DfM)

VariationInformation

NA, NA, ToxTox

Latency, PowerLatency, Power

Fault ProbabilityFault Probability

VddVdd, Temp, Temp

VtVt, , LgLg, L, t, , L, t, tILDtILD

Quantum Physics

Mask / Process Design

Architecture Design

Logic / PhysicalDesign

Algorithm DesignFault-tolerant algorithm

Yield-improving architecture

Statistical STA

© 조준동, 2007년 여름 13

More SoC topics …

• Platform optimization– Power management– BW allocation– Resource sharing– Task distribution– Efficient communications

• Low Power• Verification

•인재 (System Architect) 양성

© Jun Dong Cho, 2007.7 14

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

MP-SoC (Multi-Processor System on Chip)의 필요성

© 조준동, 2007년 여름 15

Definition of MP-SOC?

Usually Heterogeneous Multiprocessor:

CPUs, DSPs, etc.Hardwired accelerators.Mixed-signal front end.

Definition of Multiprocessor by Enslow Jr.

MIMD machines with shared memory•Shared memory•Shared I/O•Distributed OS•HomogeneousExtended definition: All parallel machines (wrong usage)

© 조준동, 2007년 여름 16

Future Microprocessors

© 조준동, 2007년 여름 17

Why MP?

Uniprocessors have hit the ceilingGet performance from better architecture instead of more MHz

© 조준동, 2007년 여름 18

Anatomy of a Cellular Phone

3G Wireless Protocols

© 조준동, 2007년 여름 19

MP-SoC 응용 분야:4G: Multiple standards

Communications.Networking.Multimedia.Security.

Mutiband/multimode를 지원하는 Digital RF

© 조준동, 2007년 여름 20

MP-SoC Platform의 진화 방향(WCDMA+CDMA2000의 예)

© 조준동, 2007년 여름 21

System Architecture for 3G

•4 PEs–static kernel mapping

and scheduling–SIMD+Scalar units•1 ARM GPP controller–scalar algorithms and

protocol controls

© 조준동, 2007년 여름 22

ARM MPCoreTM 아키텍쳐

© 조준동, 2007년 여름 23

재구성 및 Scalable MP-SoC 플랫폼

© 조준동, 2007년 여름 24

Road Map to MP-SoC Trends

• Mask NRE: Over 1M$; • Design NRE: 10M$ to 75M$

– ASICs replaced by programmable ASSP, FPGA’s• Number of embedded processors

– DVD/STB/HDTV, mobile phones: 5 to 8• Image proc, networking, basestation: 8 to 100+• E-S/W complexity

– Set-top box, audio: >1 million lines of codeE-S/W becoming essential part of SoC’s

WhoWho’’s Law?s Law?

© 조준동, 2007년 여름 25

Why is MP-SOC Challenging?

© 조준동, 2007년 여름 26

Software Defined Wireless Multimedia Terminals

•Lower costs–Platform longevity, higher

volume–SW has lower development

costs•Time to market–Future protocols will have

complex implementations–Overlap testing/development

cycles•Adaptability–Standards change over time–Multi-mode operation–Sharing hardware resources

Multistandard Radio

• UMTS• GSM/GPRS/EDGE• WLAN• Bluetooth• UWB

Multistandard M/M• H.264• MP3• AAC• GPS• DVB-H• TPEG

SDR = Reconfigurable Radios

© 조준동, 2007년 여름 27

SDR Configuration• Modulation Format

– QPSK– DQPSK– π/4 DQPSK– {16,64,256,1024} QAM– OFDM– OFDM CDMA

• Digital Down/Up Conversion (DDC)– Channel Center– Decimation/Interpolation rates– Compensation Filters– Matched Filter α = {0.25,0.35,...}

• FEC– Convolutional– Reed-Solomon– Concatenated Coding– Turbo CC/PC– (De-)Interleave

Soft RadioDigital Signal

Processing Engine

• Network Interface Definition

• Channel Access– CDMA– TDMA

• Security• Beam Forming

• DSSS– Rake, track, acquire– Multi User Detect. (MUD– ICU

© 조준동, 2007년 여름 28

Future mobile applications?

• Mobile supercomputing– Speech recognition.– Cryptography.– Augmented reality.– Typical applications (email, etc.).

• Requires 16x 2 GHz Pentium 4 ?

Mudge et al:

Culture and Education? Personal Entertainment ?

© 조준동, 2007년 여름 29

Broa

dcas

ting,

Ubi

quito

us

Health, H

uman, Bio

MP-SoC 응용 분야

D-TV

CIS Mobile

Recorder

Health

HCI Bio

Data Broadcasting

RFID

Automotive & Robotics

Telematics UnmannedDriving Robot

© 조준동, 2007년 여름 30

MPSoC today

• High performance, low power: there is no other way than MPSoC!

• Virtually all processor vendors are on the MPSoC route– TI: OMAP, DaVinci– STM: Nomadik– IBM: Cell– Intel: IXP, CoreDuo– Philips: Nexperia– Atmel: Diopsis– ARM: MPCore– ARC: VRaptor

• Urgent need for MPSoC design tools– Application design and platform capture– Architecture exploration and optimization– Simulation and verification– Application to architecture mapping

© 조준동, 2007년 여름 31

The triangle, Chicken and Egg?

architectures

applications

methodologies

•Hardware and software architectures determine capabilities.•Applications guide design decisions.•Methodologies allow repeatable, predictable design.

© 조준동, 2007년 여름 32

Tape Out

VerifyCompose the system

VerifySimulate

VerifySoC Composer

Verify (timing, area)Synthesis + P&R

VerifySimulate (performance)

Should the SoC designer work hard?

Requirements

Mobile SoC에서검증이 왜 중요한지?

왜 우리는 검증이취약하게 되었는지

© 조준동, 2007년 여름 33

Some statements from MPSoC 2006 Symposium

• The ad-hoc approach to SoC design cannot scale with Moore’s Law...The SW development environment as afterthought era of IC design is rapidly drawing to a close“ (K. Keutzer, UCB)

• Power-constrained CPUs are mandatory, but the most exciting features require system-level SW optimization“ (M. Kuulusa, Nokia)

• Multi-core platforms are a reality – but where is the SW support?“ (R. Lauwereins, IMEC)

© 조준동, 2007년 여름 34

Gartner, 2007년 10대 기술 발표

• 향후 3년간 성숙단계에 이를 것으로 예상되는 10대 기술을발표(2006년 제25차 가트너 데이터 센터 연례회의, 2006.11.28~12.1)

• 오픈소스(Open Source), 가상화(Virtualization), 정보 액세스 (Information Access), 유비쿼터스 컴퓨팅(Ubiquitous Computing), 그리드 컴퓨팅(Grid Computing), 컴퓨트 유틸리티(Compute Utilities), 멀티코어 프로세서(Multicore Processors), 웹 2.0(Web 2.0), 네트워크 통합(Network Convergence), 수냉 방식(Water Cooling)

http://www.gartner.com/2_events/conferences/lsc25.jsp

© Jun Dong Cho, 2007.7 35

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Fundamental to Parallel Machines

© 조준동, 2007년 여름 36

Purposes of multiple processors

• Performance– A job can be executed quickly with multiple

processors

• Fault tolerance– If a processing unit is damaged, total system

can be available: Redundant systems

• Resource sharing– Multiple jobs share memory and/or I/O modules

for cost effective processing:Distributed systems

• Low power– High performance with Low frequency operation

© 조준동, 2007년 여름 37

DSP

Why Multi-Threaded Cores?

Out

NoC

In SRAM

DSPDSPH/W-MTRISC

H/WProc. Element

$GPP

I$D$ I$

Increasing gap: memory & processor

speeds(2x / 2 years)

Increasing gap: interconnect &

gate delays(multi-clock)

More parallel processing

(lower-power, higher-perf./mm2)

© 조준동, 2007년 여름 38

Flynn’s Classification

• The number of Instruction Stream:

M(Multiple)/S(Single)

• The number of Data Stream:M/S– SISD

• Uniprocessors(including Super scalar、VLIW)

– MISD: Not existing(Analog Computer)– SIMD– MIMD

© 조준동, 2007년 여름 39

MIMD

Processors

Memory modules (Instructions・Data)

•Each processor executes individual instructions•Synchronization is required•High degree of flexibility•Various structures are possible

Interconnectionnetworks

© 조준동, 2007년 여름 40

Classification of MIMD machines

• UMA(Uniform Memory Access Model)provides shared memory which can be accessed from all processors with the same manner.

• NUMA(Non-Uniform Memory Access Model)

provides shared memory but not uniformly accessed.

• NORA/NORMA (No Remote Memory Access Model)

provides no shared memory. Communication is done with message passing.

© 조준동, 2007년 여름 41

An example of UMA:Bus connected

PU PU

SnoopCache

PU

SnoopCache

PU

SnoopCache

Main Memory

shared bus

SnoopCache

SMP(Symmetric MultiProcessor)

On chip multiprocessor

© 조준동, 2007년 여름 42

Switch connected UMA

Switch

CPUInterface

Local Memory

Main Memory

. . . .

….

© 조준동, 2007년 여름 43

NUMA

• Each processor provides a local memory, and accesses other processors’ memory through the network.

• Address translation and cache control often make the hardware structure complicated.

• Scalable:– Programs for UMA can run without

modification. – The performance is improved as the system

size.

Competitive to WS/PC clusters with Software DSM

© 조준동, 2007년 여름 44

Typical structure of NUMA

Node 1

Node 2

Node 3

Node 0 0

InterconnectonNetwork

Logical address space

© 조준동, 2007년 여름 45

Classification of NUMA

• Simple NUMA:– Remote memory is not cached.– Simple structure but access cost of remote

memory is large.• CC-NUMA:Cache Coherent

– Cache consistency is maintained with hardware.

– The structure tends to be complicated.• COMA:Cache Only Memory Architecture

– No home memory– Complicated control mechanism

© 조준동, 2007년 여름 46

Cray’s T3D: A simple NUMA supercomputer (1993)

• UsingAlpha 21064

© 조준동, 2007년 여름 47

The Earth simulator(2002)

© 조준동, 2007년 여름 48

NORA/NORMA

• No shared memory• Communication is done with

message passing• Simple structure but high pe

ak performance

The fastest processor is always NORA(except The Earth Simulator)

Hard for programming

Inter-PU communications Cluster computing

Early Hypercube machine nCUBE2

© Jun Dong Cho, 2007.7 49

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

MP-SoC Examples and Applications

© 조준동, 2007년 여름 50

Dual-Core (DSP+ARM) Platform

© 조준동, 2007년 여름 51

IBM Power4

– 2 cores– F = 1.4GHz– Single clock over entire

die– Balanced H-tree driving

global grid– Measured clock skew

below 25ps– Power ~85W– 180nm SOI process,

174M transistors

© 조준동, 2007년 여름 52

IBM’s Multiple processors on MCM

- 4 POWER4 chips into single module (MCM)

– The POWER4 chips connected via 4 128-bit buses

– Up to 128MB L3 cache– Bus speed ½ processor

speed– Total throughput ~35

GB/s

© 조준동, 2007년 여름 53

MPSoC “Bus” Alternatives• Fixed Bus [Bergamaschi, DAC, 2000]– Point to point communication– Signals between cores transferreddedicated wires• FPGA-like Bus [Cherepacha, FPGA Sym,– Programmable interconnects– Employ static network• Arbitrated Bus [IDT Inc., 2000]– Time-shared multiple core connectivity– Use arbitrator• Hierarchical Bus [AMBA, ARM Inc]– Combine multiple buses using bus– Separate buses for cores and I/O• NoCBus [Dally, DAC, 2000]– Resources communicate with data packets– Use switch fabric

© 조준동, 2007년 여름 54

Available Mobile Processors

• The ARM Family– The ARM7 Generation– The StrongARM– The ARM Thumb Option– The ARM Piccolo Option– The ARM9 and ARM10

• The Motorola M-Core• The LSI TinyRisc• The Hitachi SuperH Family• VLIW Processors

– The Motorola-Lucent Star*Core– The Philips TriMedia– The HP/Intel IA-64

© 조준동, 2007년 여름 55

Available MP-Cores

• TI OMAP• Philips’s NexperiaTM DVP• ST Nomadik• Intel® Itanium® Montecito• CELL Processor• CT 3400 Multi-core DSP• Hibrid SoC• Systolic Ring• Virtual platform in SHAPES project

© 조준동, 2007년 여름 56

TI OMAP

• Targets communications, multimedia.

• Dual-processor (DSP, RISC) with shared memory

• Hierarchical Definition of Platform

• Critical Role of Software as well as Hardware

• OCP (Open Core Protocol) based SoCplatform

C55x DSP

OMAP 5910:

ARM9

MMU

Memory ctrl

MPUinterface

SystemDMAcontrol

bridge

I/O

© 조준동, 2007년 여름 57

플랫폼 계층 및 구분

• Level 0: Foundation Platform– Infrastructure & standards : Basic Arch.

• Processor core, Peripheral/Interface IP, Bus: e.g., ARM PrimeXsys

• Level 1: Application specific Integration Platform

• Application Specific SoC: HW & SW• Mobile Platform, Home Platform

• Level 2: System Platform• Terminal Platform• Handset case: RF + Modem + AP + Memory + MMI

© 조준동, 2007년 여름 58

Hierarchy of Platforms in OMAP

Reference Design

Application Platform

SoC Platform

ASIC Library & Tools

Silicon Technology

Application Specific

Broadly Applica

ble

OMAP Products

OMAP Infrastructure

Reuse

System Platform

© 조준동, 2007년 여름 59

Scalable Multi-processors

© 조준동, 2007년 여름 60

TI OMAP 1510 Platform Architecture

Peripheral Bus

TI925

SDRAM Bus(16)

Peripherals: LCD Controller, Interrupt Handlers, Timers, GPIO, UARTs, McBSP

Peripheral Bus

C55X

SystemDMA

Peripherals Buses (8/16/32)

MAIL Box

DSP MMU

IMIF

Traffic Controller

EMIFF EMIFS

SRAMLB MMU

HASB MMU

Flash Bus(16)

LB(32)

HASB(32)

GPP:TI925 Core

- 16KB I-Cache - Write Buffer- MMU and D-MMU- Dual TLB

DSP:C55x CoreInternal Memory

- 48KW SARAM - 32KW DARAM- 16KW PDRAM

24KB I-CacheGraphic HW AcceleratorARM Port Interface

IPC- Mail Boxes - API- DSP MMU

System DMATraffic ControllerInternal SRAMBussesPeripherals

© 조준동, 2007년 여름 61

TI OMAP 1510 Platform S/W

TI925 General-Purpose Processor

OS kernel& drivers

TMS320 DSP

MPEG4

OS adapter LINK driverMCU Bridge Kernel

RESOURCE MANAGER

LINK driver Other drivers

DSP/BIOS KernelRM Server

MP3 AMRMEDIA APIs

raw data streams video audio speech

XDAIS AlgorithmsEncapsulated in socket nodes

Node Data Base

© 조준동, 2007년 여름 62

Philips’s NexperiaTM DVP

(source: Th. Claasen, Philips, DAC 2000)

© 조준동, 2007년 여름 63C

ompr

esse

d A

/V In

put B

us

Philips NexperiaTM DVP S/W Reference Architecture

Analog Inputs

Analog Front End

Analog Front End

Digital Inputs

Analog Front End

Optical Drive

Network Protocols

Hard Disk

Digital Front Ends

Com

pres

sed

A/V

Inpu

t Bus

Network Protocols

Players

Broadcast-MPEG2

VCD/SVCD

DVD

CD/SACD

WMT

RN

Broadcast-MPEG4

Recoders

DVD+RW Auth

PVR-SPTS

Lo-Rate SPTS

CD/DVD-MP3

• • •

Unc

ompr

esse

d A

/V In

put B

us

Transcoders

Translaters

TS-SPTS Filter

Loopback / Feedthrough

Digital Outputs

Protocol Stack

Network Protocols

Driver

HDD/Ethernet

Presentation Engine

Audio and Video Processing

© 조준동, 2007년 여름 64

Philips NexperiaTM DVP MP-SoC

• Philips's advanced set-top box anddigital TV SoC (Viper2)

• 0.13 μm• 50 M transistors• 100 clock domains• > 60 IP blocks

© 조준동, 2007년 여름 65

ST Nomadik

• Targets mobile multimedia. A Heterogenousmultiprocessor-of-multiprocessors.

© 조준동, 2007년 여름 66

Power Distribution 인텔 제온 프로세서

© 조준동, 2007년 여름 67

Clock and Power Convergence

Dynamic voltage and frequency scaling (DVS)

© 조준동, 2007년 여름 68

Intel® Itanium® Montecito - Clock system architecture

– Each core split into 3 clock domains on variablepower supply

© 조준동, 2007년 여름 69

Intel® Itanium® Montecito - Power management

– Dynamic voltage-scaling power management system– 4 on-die sensors– On-die microcontroller– Power and temperature measurement– Voltage and frequency modulation– 8μs power/temperature sampling interval– Embedded firmware– Power, temperature, or calibration measurements– Power: closed-loop power control and system

stability check– Temperature: thermal sensor readout (junction

temperature below 90°C monitoring) and power-control communication

– Calibration: power-measurement accuracy check

© 조준동, 2007년 여름 70

The implementation of a first-generation CELL Processor

© 조준동, 2007년 여름 71

The Cell Processor

• Fclock > 4 GHz.• Memory bandwidth: 25.6 GBytes per second.• I/O bandwidth: 76.8 GBytes per second.• Performance:

– 256 GFLOPS (Single precision at 4 GHz).– 256 GOPS (Integer at 4 GHz).– 25 GFLOPS (Double precision at 4 GHz).

• 235 square mm.• 235 million transistors. • Power consumption estimated at 60 - 80 W @ 4GHz

© 조준동, 2007년 여름 72

Cell’s Element Interconnect Bus

• 4 rings (2 ckwise + 2 counter-ckwise)• No token rings, still request/grant arbitrations

© 조준동, 2007년 여름 73

CT 3400 Multi-core DSP

• 8개 32비트DSP 코어

• 6개 32비트 범용프로세서 코어

• 128핀 프로그램가능 I/O 서브시스템으로 구성

• C 프로그램 가능

• H.264 및MPEG4 코드를지원

http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

H.264 encoder , decoder and audio codecs and the system control

© 조준동, 2007년 여름 74

H.264 codec onto CT3400 MDSP

From cradle

© 조준동, 2007년 여름 75

CT 3400 Multi-core DSP

CT3400 DPS Engine

http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdfhttp://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

DSP Engine

Each DSP engine contains

A Single Instruction Multiple Data

Arithmetic Logic Unit (SIMD ALU)

A Packed Integer Multiplier

Accumulator (PIMAC)

A Floating Point Unit (FPU)

Bi-directional FIFO data buffers

DMA channels

A 128 x 32 register and

A 512 x 20 program memory

© 조준동, 2007년 여름 76

CT3600 Multiprocessor DSP Family Members

• CT3616은 채널 당 5.50 달러(MPEG4 SP L3)로 업계에서 가장 뛰어난가격 대 성능비 인코딩 솔루션을 제공하고 있어 가장 가까운 경쟁 제품보다 2배 이상 우수

• 프로그램 가능 DSP를 기반으로 하는 단일 칩 실시간 D1 H.264 메인 프로파일 비디오 인코더를 업계 최초로 구현한다

• 0.13미크론 기술, 16개의 DSP, 8개의 범용 프로세서로 전체 성능을 네배로 증가

• 40달러에서 90달러

http://www.cradle.com/downloads/CT3600-PB.pdf

© 조준동, 2007년 여름 77

CT 3616 Multi-core DSP

http://www.cradle.com/downloads/CT3600-PB.pdf

© 조준동, 2007년 여름 78

HiBRID-SoC Architecture

Multi-Core SoC Architecture Dedicated chips

for the Mpeg-4 Simple Profile

Integrate a powerful on-chip communication

structure

Three programmable cores: Each adapted

towards a specific class of algorithmsInstruction Level VLIW (Very long

instruction word)Data Level SIMD (Single instruction

multiple data)Task Level (Simultaneous

multithreading)

Developed at the University of Hannover

© 조준동, 2007년 여름 79

Multi-Core SoC Architecture

• Hi-par DSP• 16-datatath SIMD processor core controlled by VLIW,• Particularly optimized towards high-throughput two

dimensional DSP-style processing• (FFT-intensive applications or filtering)

• Stream Processor (SP)• 32-Bit RISC architecture that is more optimized to-wards

control-dominated task• Bitstream processing or global system control

• Macroblock processor(MP)• Efficient processing of data blocks (Heterogeneous data

path structure consisting of scalar and a vecture unit)• Controlled by dual-issue VLIW, offers flexible subword

parallelism, and contains instruction set extensions for typical processing computation steps

© 조준동, 2007년 여름 80

HiBRID-SoC multi-core architecture

64-bit AMBA AHB system

bus

Connects all cores

SDRAM memory via a

64 Bit SDRAM

interface

Two versatile 32-Bit

host interfaces for

access (e.g., host PC

via PCI and to serial

flash memory)

© 조준동, 2007년 여름 81

HiPAR-DSP

Highly paralled DSP core with a

VLIW-controlled SIMD

architecture

DMA unit serves all cache misses

and performs data prefetch

transfers to the matrix memory

At the targeted clock frequency of

145 MHz, the HiPAR-DSP

achieves a performance of 2.3

GMACs

© 조준동, 2007년 여름 82

Macroblock processorHeterogeneous data path structure consisting

of a scalar and a vector data path

The scalar data path operates on 32-Bit data

words in a 32-entry register file and provides

control instructions (jump,branch, and loop)

The vector data path is equipped with a 64

entry register file of 64 bit width

Special fuction unit(SFU) provide

instruction set extensions for common video

and multimedia core algorithms.

MUL/MAC or ALU, incorporate SIMD-

style subword parallelism by processing

either two 32-Bit, four 16-Bit, or eight 8-Bit

data entities in parallel within a 64-bit

register operand

© 조준동, 2007년 여름 83

HiBRID-SoC Implementations

Chip layout of the HiBRID-SoC.

MPEG-4 ASP decoder (full TV resolution) performance on MP and SP, 720*576@25Hz,1.5-3 Mbits:

HiBRID-SoC is fabricated in a 0.18 um,

6LM standard-cell technology,

14 million tr’s 3.5W

82 mm2, 145 MHz

© 조준동, 2007년 여름 84

New Taxonomy/Metric

• Flynn: Triple (d,i,c)d: # of data streamsi: # of instruction

streamsc: # of configuration

states

SISD, SIMD, MIMD,MISD

• RA: (c,g,a)– c: configurability to

various environment– g: size of granularity– a: adaptability to

various components

– SCSG,SCMG,SCLG– MCSG,MCMG,MCLG

© 조준동, 2007년 여름 85

Systolic Ring

• Based on a coarse-grained configurable PE

• Circular datapathsC: # of layers C = 4N: # of Dnodes per layer

N = 2S: # of Rings s = 1

• Control Units (sequencer)Local Dnode unitLocal Ring unitGlobal unit

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

layer 1

layer 2

layer 3

layer 4

Dnode Sequencer

Local RingSequencer

© 조준동, 2007년 여름 86

Motivation For Using Hierarchical Rings

• Relatively simple switching logic reduces the complexity at each node resulting in reduced buffer, area and energy requirements.

• Low latency since packets are forwarded in 1 clock cycle.

• Packets will always arrive in-order at the destination.

• Broadcast and Multicast packets are efficiently implemented.

• Hierarchical rings can be partitioned into independent clock domains.

© 조준동, 2007년 여름 87

Remanence

Fe

Fc

FcNcFeNR PE

..=

• NPE: # of processing elements (PE) • Nc: # of PE configurable per cycle• Fe: operating frequency • Fc configuration frequency

Characterizes the Dynamism• # of cycles to (re)configure the whole architecture• Amount of data to compute between 2 configurations

Interconnection

PE PE PE PE PE

instn

Configuration Memory

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0

Sequencer

Interconnection

PE PE PE PE PE

instn

Configuration Memory

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0

Sequencer

© 조준동, 2007년 여름 88

Operative Density

NPE: # of PE

A: Core Area (relative unit λ²)

Area can be expressed as a function of NPE

)()(

PE

PEPE NA

NNOD =

Interconnection

PE PE PE PE PE

instn

Configuration Memory

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0Sequencer

Interconnection

PE PE PE PE PE

instn

Configuration Memory

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0Sequencer

© 조준동, 2007년 여름 89

Remanence formalisation

• # of layers : C = 8• # of Dnode per layer : N = 2• 1 Systolic Ring: S = 1

0

5

10

15

20

25

30

35

40

0 20 40 60 80 100 120 140 160 180 # Dnodes

REMANENCE

k = 2k = 4

k = 8

0

5

10

15

20

25

30

35

40

0 20 40 60 80 100 120 140 160 180 # Dnodes

REMANENCE

Switch

Dnode Dnode

Dnode Dnode

Swi

tch

Dnode

Dnode

Switch

Dnode

Dnode

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Switch

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Switch

Dnode Dnode

Dnode Dnode

Swi

tch

Dnode

Dnode

Switch

Dnode

Dnode

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Switch

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

layer 1 layer 2

layer 3

layer 4

layer 5layer 6

layer 7

layer 8

k = 1k = 1

k = 2k = 4

k = 8

PEPE NkNR .)( =

k= C/N

© 조준동, 2007년 여름 90

Architectural model Characterization

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Global Bus

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

SwitchSwitc

h

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Dnode Dnode

Dnode

Dnode

Dnode

Dnode

Dnode Dnode

Switc

h

Switc

h

Switch

SwitchSwitc

h

Switc

h

Switch

Switch

Global Bus

Global Sequencer

Local RingSequencer

Local RingSequencer

Local RingSequencer

Local RingSequencer

# of layers : 4 (C = 4) # of Dnode per layer : 2 (N = 2)4 Systolic Ring (S = 4)

Control Units• Local Dnode unit• Local Ring unit• Global unit

•www.qstech.com

© 조준동, 2007년 여름 91

Best OD and remanence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

Design SpaceWorst interconnect resources and processing power

© 조준동, 2007년 여름 92

Worst OD and remanence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

0,000

0,005

0,010

0,015

0,020

0,025

0,030

0,035

0,040

0 20 40 60 80 100 120 140

# Dnodes

Ope

rativ

e D

ensi

ty

S=1

S=2

S=4

S=8

0

5

10

15

20

Remanenc

e

Rem

anence

Design SpaceBest interconnect resources and processing power

© 조준동, 2007년 여름 93

Comparisons of RA

1. Only 1 cycle to (re)configure the DSP

2. Few cycles to (re)configure coarse grain RA (≤8)

3. Many cycles to (re)configure fine grain RA

NPE Nc RName Type F (MHz)

2304 0.14 16457

24 4 6

24 4 6

128 16 8

ARDOISE

Systolic Ring

DART

MorphoSys

TMS320C62

Fine Grain RA

Coarse Grain RA

Coarse Grain RA

Coarse Grain RA

DSP VLIW 8 8

33

200

130

100

300 1

FcNcFeNR PE

..

=Pascal BENOIT

© 조준동, 2007년 여름 94

Virtual platform in SHAPES project

© Jun Dong Cho, 2007.7 95

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Homogeneity and Heterogeneity

© 조준동, 2007년 여름 96

MPSoC Architecture Trends

© 조준동, 2007년 여름 97

1~8 2~6

Exploitable Parallelism

GP O/SThread-LevelParallelism

Instruction-Level

Parallelism

1

10 000’sInstructions

Min parallel grain size (instrns.)

Exploitable taskparallelism

1~100

MultiFlex Thread-Level

Parallelism

100’s

© 조준동, 2007년 여름 98

Three Levels of Parallelism

© 조준동, 2007년 여름 99

Parallel Heterogeneous Platforms (PHPs)

• Challenges:– Explore the theoretically high performance

Platform Company PEs Het?

Cell IBM/Sony/Toshiba 9 Y

DRP NEC 512 N

Nomadik ST 3+ Y

OMAP2420 TI 4 Y

Nexperia Philips 3+ Y

X-Fi Creative 7 Y

ARM11 MPCore ARM 1-4 N

IXP2800 Intel 17 Y

MXP5800 Intel 54 Y

… … … …

(From Abhijit Davare’s Quals Presentation)

© 조준동, 2007년 여름 100

Homogeneous MP-SOC

• 32bit ARM processors• Private Memory• Shared Memory• Hardware interrupt module• Hardware semaphore

module• 32bit interconnection

(AMBA Bus or STBus)• Porcessor Core modeling :

C++• Hardware interconnection

modeling : SystemC

© 조준동, 2007년 여름 101

NEC MP211: Homogeneous MP core

• Asymmetric mp with very coarse grain multitasking

• 3 ARM9’s utilized as predefined function units

• NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing

•Asymmetric mp with very coarse grain multitasking•3 ARM9’s utilized as predefined function units•NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing

© 조준동, 2007년 여름 102

MP211 block diagram

© 조준동, 2007년 여름 103

Power consumption of H.264+AAC

H.264 video decoder(QVGA 15fps)와 MPEG2 AAC decoder(48K Stereo 128kbps)DTV: 87mW(exclude I/O, SDRAM), 124mW(include I/O, SDRAM)L0의 영역은 기본적인 전력의 소모를 뜻하며, L1 영역은 IP에서 높은 IP 전력소모가실행되고 있는 영역을 뜻한다.

© 조준동, 2007년 여름 104

Homogeneous MP의 문제점

▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 크게 된다.

▷ 같은 (호모지니어스) 프로세서를 여러 개 사용하는 것은

자원 유용도가 낮아서 리니어로 전력량이 늘어나게 된다. ▷ 인터콘넥트는 와이어-의존 뿐아니라 로직 의존적이기도 하

다. ▷ 프로세서가 와이어와 메모리 지연시간에 의해서 제약된다. ▷ 특정 응용분야에 대해서만 최고 성능을 낸다. ▷ 온 칩 인터콘넥션의 설계가 코어와 캐쉬와 분리해서 독립적

으로 설계되었다.

© 조준동, 2007년 여름 105

Heterogenous MP Core

If two or more cores share L2, the way a lot of present CMPsdo, a crossbar provides a high bandwidth connection.

Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level parallellism,instruction level parallelism을 동시에 처리 가능하도록 설계되었다.

© 조준동, 2007년 여름 106

Heterogeneous MP core

• 쉬운 하드웨어 Implementation이 가능하다. : 즉, 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른 하드웨어 개발기간과 가격을낮출 수 있다.

• 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다.

• Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다.

• Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 다중 어플리케이션간 인터페이스를 줄일 수있다.

• 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이 가능하다.

© 조준동, 2007년 여름 107

Heterogenous MP Core

▷ Homogeneous CMP (Chip Multiprocessor)와 비교해서 Heterogeneous

CMP(또는 asymmetric CMP)는 많은 장점을 가지고 있다. 많은 응용 제품들은

큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다.

▷ Multi-ISA multicore architecture는 vector/data-level parallellism,

instruction level parallelism을 동시에 처리 가능하도록 설계되었다. 코어

숫자와 크기, 타입, 그리고 캐쉬를 결정해야 한다. 8-core 프로세서의 경우,

인터콘넥트의 전력 소모량은 하나의 코어와 같다.

▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을

이용하는 heterogeneous processors는 homogeneous에 비해서 63%

성능이 개선된다. 5-8 threads level을 사용하는 경우에는 평균 29%의

개선이 있다.

▷ 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며, 병렬 부분에

대해서는 전력 소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를

최대화 한다. [Annavaram, et al]

© 조준동, 2007년 여름 108

NEC’s Asymmetric(or Heterogeneous) Multi processing

• 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른하드웨어 개발기간과 가격을 낮출 수 있다.

• 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어 멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다.

• Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다.

• Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 어플리케이션간 인터페이스를 줄일 수 있다.

• 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이가능하다.

© 조준동, 2007년 여름 109

Heterogeneous MP-SoC 문제점들

• Processors are bound by wire and memory latencies

• Peak performance on only a small class of applications.

• How well they map to a given design• Diversification of workloads • Increased hardware complexity • Poor resource utilization

© 조준동, 2007년 여름 110

AMP task allocation image

© 조준동, 2007년 여름 111

Bus and Memory Architecture

© 조준동, 2007년 여름 112

Alpha cores scaled to 0.10 um.

EV8 is 80 times bigger but provides only two to three times more single-threaded performance

© 조준동, 2007년 여름 113

Equal-area heterogeneous architectures with multithreaded cores

© 조준동, 2007년 여름 114

Exploring the potential from heterogeneity

© Jun Dong Cho, 2007.7 115

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

MP-SoC Design Automation

© 조준동, 2007년 여름 116

Optimization and Synthesis

• Computation Synthesis:– Task Allocation–Task Scheduling

• Communication Synthesis:– Interconnection Synthesis–Buffer sizing

© 조준동, 2007년 여름 117

Energy-Aware Task mapping

Minimize Energy Consumption, given a CTG and a heterogenous NoC• Find:

– A mapping function M : tasks(T) => PEs (P)– Assuming the tasks are already scheduled and partitioned

• Solution formulated as a quadratic assignment problem and solved using Branch and Bound.

• Communication-optimal task mapping– minimal hardware (buffers and wires) required to

meet the timing requirements defined in the specification.

– given a multiprocessor network find a mapping of the application satisfies the timing constraints.

• Genetic algorithm (Chromosome, Generation, Crossover, mutation)

Addressed by Hu et al 2002:

© 조준동, 2007년 여름 118

Interconnection Synthesis– With each new

technology:– Gate delay decreases

~25%– Wire delay increases

~100%

– Cross-chip communication increases

– Clock needs multiple cycles to cover die

Source: SIA NTRS Projection

© 조준동, 2007년 여름 119

Interconnect Delays & Density

Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology

© 조준동, 2007년 여름 120

Buffer Sizing

• Architectures have bounded buffer resources.• If more communication buffer resources are

utilized, processors may spend less time waiting to send/receive data.

• Additional buffer resources may adversely affect communication overhead, achievable clock speed, or design closure.

© 조준동, 2007년 여름 121

Multiple Clocks due to Interconnect limitation

© 조준동, 2007년 여름 122

MPSoC HW platform perspective

• Today´s platforms are quite heterogeneous– Reasons: efficiency and legacy IP

• Homogeneous MPSoC would scale welland would simplify programming– Works well for desktop PCs– Too inefficient for embedded apps

• Mixed MPSoC as a compromise?– Globally homogeneous, locally heterogeneous– (re)configurable PEs

www.iss.rwth-aachen.de

© 조준동, 2007년 여름 123

Future MPSoC programming

• Sequential-to-parallel code generationC code (and platform/RTOS model) in,

• parallel C codes out» Step 1: exhibit parallelism at block/task level

to the user for manual mapping» Step 2: automate code partitioning/mapping

• Massive use of compiler technology, e.g.data flow analysis

• Use of „platform refinement“ technology asbackend for machine code generation and

simulation www.iss.rwth-aachen.de

© 조준동, 2007년 여름 124

The von Neumann inheritance

• Sequential programming of sequentialmachines– Pascal, Modula-2, C, C++, Java, ...

• Sequential programming of parallel machines?– VLIW: handled by sophisticated compilers– SIMD: will be accomplished by compilers– Does not scale to heterogeneous MPSoC

with– distributed control paths!

• Parallel programming of parallel machines!– We need to move ... to parallel thinking and

programming...We are standing at the verybeginning...It´s a huge area. (J. Gutknecht,

– ETH Zurich)• What to do in the meantime?

© 2007 R. Leupers

© 조준동, 2007년 여름 125

Block clustering approach

www.iss.rwth-aachen.de

© 조준동, 2007년 여름 126

Block clustering approach –Cn’t

www.iss.rwth-aachen.de

© 조준동, 2007년 여름 127

Block clustering approach –Cn’t

www.iss.rwth-aachen.de

© 조준동, 2007년 여름 128

멀티코어 SoC 설계방법

© Jun Dong Cho, 2007.7 129

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Network On Chip

© 조준동, 2007년 여름 130

Technology Evolution

© 조준동, 2007년 여름 131

What are NoC’s?

• According to Wikipedia:

– “Network-on-a-chip (NoC) is a new paradigm for System-on-Chip (SoC) design. NoC based-systems accommodate multiple asynchronous clocking that many of today's complex SoC designs use. The NoC solution brings a networking method to on-chip communications and claims roughly a threefold performance increase over conventional bus systems.”

© 조준동, 2007년 여름 132

Network-on-Chip (NoC)

• Communication is achieved by connecting switches together to form a network topology:

• Offers much greater scalability.• parallelism: multiple components can send

data simultaneously• energy efficient: point-to-point connections

require less energy than a bus.• Global synchronization is no longer needed.

© 조준동, 2007년 여름 133

NoC Design Considerations (I)

• There are several popular topologies:– 2D Mesh (most popular).– Torus (rings)– Tree (fat-tree, butterfly fat-tree)

• The on-chip interconnection network will soon be a limiting– factor for performance and energy consumption:– has been reported to account for over 50% of the total

energyrequirement!

• The interconnect should consume the fewest resourcespossible and should be:– area efficient: switches should be as small (simple) as

possible.– energy efficient: related to area efficiency– fast: simple routing algorithms should be used.

© 조준동, 2007년 여름 134

ProcessorMaster

GlobalMemory

Slave

Global I/OSlave

Global I/OSlave

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

ProcessorMaster

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

RoutingNode

NoC exemplified

© 조준동, 2007년 여름 135

NoC: Good news

☺ Only point-to-point one-way wires are used, for all network sizes.

☺ Aggregated bandwidth scales with the network size.

☺ Routing decisions are distributed and the same router is re-instanciated, for all network sizes.

☺ NoCs increase the wires utilization (as opposed to ad-hoc p2p wires)

Sergio Tota and Mario R. Casu

© 조준동, 2007년 여름 136

There’s no free lunch…

Internal network contention causes (often unpredictable) latency.The network has a significant silicon area.Bus-oriented IPs need smart wrappers.Software needs clean synchronization in multiprocessor systems.System designers need reeducation for new concepts.

Sergio Tota and Mario R. Casu

© 조준동, 2007년 여름 137

Facts about NoC’s

• It is a way to decouple computation from communication

• The design is layered (physical, network, application…): Taming complexity is made easier

• Communication between processing elements in NoC takes place by encapsulating data in packets

• The elementary packet piece to which switch and routing operations apply is the flit

© 조준동, 2007년 여름 138

Topologies• Heritage of networks with new constraints

– Need to accommodate interconnects in a 2D layout– Cannot route long wires (clock frequency bound)

a) SPIN, b) CLICHE’c) Torusd) Folded toruse) Octagonf) BFT.

© 조준동, 2007년 여름 139

SPIN (Guerrier et al., DATE ’00/’03)

• Wormhole switching, adaptive routing and credit-based flow control. • It is based on a fat-tree topology.• A flit is only one word (36 bits, 4 bits are for packet framing). • The input buffers have a depth of 4 words

© 조준동, 2007년 여름 140

Dally et al., DAC’01• 2D folded torus topology• Wormhole routing and Virtual Channels (VC)

© 조준동, 2007년 여름 141

Kumar et al., ISLVLSI’02

• Chip-Level Integration of Communicating Heterogeneous Elements, CLICHÉ’• 2D Mesh Topology• Message Passing

© 조준동, 2007년 여름 142

Pande et al., TCOMP’05 • Butterfly Fat Tree• Wormhole, Virtual channels• Header flits: 3 ck cycles latency (input arbitration, routing, output arbitration)• “Body” flits: 3 ck cycles (input arbitration, switch traversal, output arbitration

© 조준동, 2007년 여름 143

Goossens et al., IEE CDT’03

• Both VCT and WH, GT and BE, IQ and VOQ

• GT uses TDM to avoid contention and create virtual circuits. In each time slot a block of 3 flits is transferred from In “j” to Out “k”in a S&F fashion.

• BE uses Matrix Scheduling• GT connections set up by BE

special system packets• Prototype with WH and IQ

– 5 ports– 0.13 um, 0.26 mm2 , 500/166 MHz– Flit size = 3 words, each 32 bits– 80 Gb/s aggregate bandwidth

© Jun Dong Cho, 2007.7 144

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

SKKU’s Mobile MP-SoC Platform

Koonshik Cho & Jun Dong ChoMobile SoC Design Automation Lab.

Sungkyunkwan Univ.

© 조준동, 2007년 여름 145

1. Multiprocessor SoC 설계 Platform

• SW-성능 개선과 표준 변동에 능동적으로 대처

• HW- Modular, Flexible and Scalable Architecture

• Platform based design

2. Multiprocessor SoC Platform test

• DVB-T Receiver

3. Tools

• Seamless CVE (Mentor Graphics)

• ADS(ARM)

SoC (DSP+ARM) Platform

© 조준동, 2007년 여름 146

SoC (DSP+ARM) Platform

© 조준동, 2007년 여름 147

Extended multi-processor platform

© 조준동, 2007년 여름 148

ARM Platform

© 조준동, 2007년 여름 149

AMBA BUS (1)

AMBA BUS는 Multiplexer, Arbiter, Decoder가 있어 여러개의 Master와 Slave를 중재해 주는 역할을 한다.

© 조준동, 2007년 여름 150

AMBA BUS (2)AMBA Bus (Master to slave multiplexer)

• Bus Master는 Address나 Control signal들을 Slave로 내보냄으로 Read 나Write 등의Operation을 할 수 있도록 해 주는 장치이다. 동시 간에 하나의Master만이 전송을 가능하게 한다. 또한 Multiple master가 가능하다

© 조준동, 2007년 여름 151

AMBA BUS (3)

AMBA Bus (Slave to master multiplexer)

• Bus Slave는 주어진 Address-space안에서 Master의 Read와 Write를 가능하게 해주는 장치이다. Slave는 Ready 및 Response signal을통해 동작 상태에 대해 Master에게 알려준다. 또한 Multiple slave가가능하다.

© 조준동, 2007년 여름 152

AMBA BUS (4)

• AHB Arbiter : Bus Arbiter는 한번에 오직 하나의 Master가 선택

되도록 하는 역할을 한다. 고유의 Priority scheme을 가지고 이러

한 Arbitration을 하게 되는데, AHB에는 오직 하나의 Arbiter가 존

재하게 된다.

• AHB Decoder : AHB Decoder의 역할은 Master로 나오는

Address의 상위 비트를 가지고서 적절한 Slave를 선택해 주는 것

이다. AHB에는 역시 하나의 Decoder가 존재한다.

• APB Bridge : APB (Advanced Peripheral Bus)상의 유일한 Bus Master이다. APB Bridge는 ASB의 Slave로서 Decoder에서 APB가 선택이 되었을 때는 APB 상에서 Master의 역할을 하게 된다

APB Bridge는 Slave module로 Local peripheral bus를 대신해

서 Bus handshake와 신호 Retiming을 조정한다.

© 조준동, 2007년 여름 153

AMBA BUS (5)

• Interrupt controller : 최대 32개의 Interrupt source로부터 Interrupt request 신호를 받아서 ARM9 프로세서에 인가되는 nIRQ 또는 nFIQ 신호를 생성한다. 32개의 Interrupt source 중에서 0~3번 Interrupt source가 nFIQ, 4~31번 Interrupt source가 nIRQ를생성한다. 낮은 번호일수록 높은 우선순위를 가진다.

• Timer :Timer 모듈에서는 3개의 Timer 기능을 제공한다. 각 Timer는 16bit counter로서 16, 256, 4096의세 가지 Prescale을 지원하며, 매 주기마다 Counter값을 1씩 감소시키고, Count값이 0이 되면 Interrupt를발생시킨다. ARM9 프로세서가 Timer interrupt clear 레지스터를 통해 Interrupt ack 신호를 줄 때까지Interrupt request를 유지한다.

© 조준동, 2007년 여름 154

Teak DSP Platform

• 전제 플랫폼에서 Co-프로세서인 Teak DSP 플랫폼의 구조

© 조준동, 2007년 여름 155

Configuration of crossbar switch

• Communication interface Architecture (Crossbar 구조)

© 조준동, 2007년 여름 156

재구성 가능한 크로스바 스위치

VHDL 의 generate문을 사용

© 조준동, 2007년 여름 157

재구성 가능한 크로스바 스위치(VHDL code)

entity CI_TOP isgeneric ( number_of_masters , number_of_slaves : integer);

port ( …생략);

end CI_TOP;

CI 모듈의 entity (ci_top.vhd)

COMMUNICATION_INTERFACE : CI_TOPgeneric map( number_of_masters=>4 , number_of_slaves =>6)

port map( …생략);

CI 모듈의 사용(multiplatform.vhd)

© 조준동, 2007년 여름 158

Advantage Disadvantage

Mux

‣비교적 쉽게 구현 가능‣Master, Slave 가적은 경우 효과적

‣Processor 간 병렬처리가 어려움‣시스템이 확장될 경우병목현상을

발생

Crossbar

‣Processor 간 효과적인 병렬처리가 가능‣시스템이 확장되어도같은 Delay를 가짐

‣구현이 어려움‣Size 및 low-power면에서 비교적불리함

Communication Interface Mux vs Crossbar

© 조준동, 2007년 여름 159

Interconnection network

Omega interconnection Octagon interconnection

Mesh interconnection

© 조준동, 2007년 여름 160

장점 단점

Shared bus ‣비교적 쉽게 구현 가능

‣마스터, 슬레이브가 적은

경우 효과적.시스템이 확장되어도 같은 Delay를 가짐

‣프로세서 간 병렬처리 힘듦

‣버스 효율 낮음

‣전력소모 많음 (broadcasting) ‣구현 복잡도 - 낮음

Crossbar ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 우수

‣데이터 path - 보통

‣구현 복잡도(라우팅 및 스케줄링) - 보통

‣확장에 따른 Size 및 wiring 증가

Omega network

‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 보통

‣데이터 path - 우수(짧음)

‣구현 복잡도(라우팅 및 스케줄링) 높음

‣확장성이 다소 떨어짐

Octagon ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 보통

‣데이터 path - 우수 (가장 짧음)

‣구현 복잡도(라우팅 및 스케줄링) 높음

‣확장성이 다소 떨어짐 (마스터, 슬레이브의 개수 8개로 제한)

Mesh ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 우수

‣데이터 path - 보통

‣구현 복잡도(라우팅 및 스케줄링) 매우높음

‣중, 대형 시스템에 적합

Interconnection network

© 조준동, 2007년 여름 161

네트워크 라우터 셀

– 멀티프로세서 플렛폼으로 4개의Master와 6개의 Slave구조

– CI Cell은 24개의 2by2 mux 구조로 설계

– CI Controller => Req, Grant, mux control etc.

– Seamless CVE와 Modelsim을연동한 상태에서 ARM926,Teak DSP가 동시에 slave에 접근하여각각의 데이터를 Read & Write 플랫폼 Function Block

– 각 Master가 Slave(Ips)로 접근시 CI Controller내부 기능은Request신호, Grant 신호 및 각Mux Control 제어신호, Round Robin기능, Decoder기능 수행 Ci Controller inner block

© 조준동, 2007년 여름 162

CI-controller State Diagram

© 조준동, 2007년 여름 163

CI controller simulation waveform

© 조준동, 2007년 여름 164

DVB-T Baseband Receiver

© 조준동, 2007년 여름 165

Hardware-software co-design flows

© 조준동, 2007년 여름 166

A shared memory structure and hardware-software partitioning

© 조준동, 2007년 여름 167

Frequency offset compensator hardware

© 조준동, 2007년 여름 168

Fine and Coarse Frequency Synchronizer (Beek & Classen)

© 조준동, 2007년 여름 169

FFT block diagram

© 조준동, 2007년 여름 170

Equalizer hardware block diagrams

© 조준동, 2007년 여름 171

DVB-T baseband Receiver Scheduling

© 조준동, 2007년 여름 172

DVB-T baseband Receiver Scheduling

© 조준동, 2007년 여름 173

Performance evaluation

Processing Types /

Functional BlockSW

SW & HW (Teaks + ARM +

HW IP)

HW(IP) only MAL

Frequency compensator & Remove Guard - 182.5us 13.8us 10.5us

Fine Freq. sync. (Beek) - 56.3us 1.5us 7.8us

Symbol Timing Recovery 144 us - - 5.2us

FFT - 188.9us 38.6 us 13.6us

Coarse Freq. Sync. (Classen) - 241us 3.3us 11us

Scattered Pilot Detection 46.5us - - 3.3us

Equalizer - 219.5us 11.2us 9.5us

De-mapping 19.9us - - 4.9us

© 조준동, 2007년 여름 174

Task Chart of Multi-processor platform for DVB-T baseband receiver

© 조준동, 2007년 여름 175

Task Chart of Multi-processor platform for DVB-T baseband receiver

© Jun Dong Cho, 2007.7 176

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Modeling of Motion Compensation IP using SCML

Le Minh Nghia & Jun Dong ChoMobile SoC Design Automation Lab.

Sungkyunkwan Univ.

© 조준동, 2007년 여름 177

Introduction of SystemC Modeling in CoWare

• TLM Peripheral Modeling with CoWare• SystemC Modeling Library (SCML)• Motion Compensation Modeling using SCML

© 조준동, 2007년 여름 178

TLM Peripheral Modeling with CoWare

• Four use-cases for Transaction-Level Modeling (TLM)– Functional View (FV)– Architecture’s View (AV)– Programmer’s View (PV)– Verification View (VV)

© 조준동, 2007년 여름 179

TLM Peripheral Modeling with CoWare

• General pattern for modeling peripheral component– Separate Behavior, Communication and Timing– Initiators and Targets depending platform are created by user.– Bus-transactor convert a generic communication into a bus-specific TLM

interface.– Accuracy of Timing depends on use-case

© 조준동, 2007년 여름 180

TLM Peripheral Modeling with CoWare

• Communication– Communication through function calls– Simulation speed strongly depends on bus-model– PV bus-model used for software development can

be simulated very fast• Behavior

– Functionality– Synchronization– Storage

• Timing– Modeling timing model based on clock object in

SystemC Modeling Library (SCML)

© 조준동, 2007년 여름 181

TLM Peripheral Modeling with CoWare

• Modeling Target pattern– Communication : Bus-transactor– Storage and Synchronization : Register bank as interface– Behavior: Collection of call-back functions, each call-back

corresponding to a bitfiel or register in register bank

© 조준동, 2007년 여름 182

TLM Peripheral Modeling with CoWare

• Modeling Initiator pattern– Communication : Bus-transactor convert posted transactions in

queue into real bus transaction– Storage and Synchronization : Include Post port and initiator storage

element scml_array (in SCML). Post port post transactions in term of nonblocking. The real synchronization depends on data and space in storage element which related to scml_array object

– Behavior: Modeled by autonomous SystemC processes

© 조준동, 2007년 여름 183

Initiator Synchronization

• Two class of initiator blocks:– Free-running initiator: all transfer initialized by

block do not need any accesses from another peripheral

– Initiator block has target port and transfers will only be initialized

• Three pattern synchronization of Initiator block:– Free-running Initiator– Fully Slaved Initiator– Semi-free running Initiator

© 조준동, 2007년 여름 184

Initiator Synchronization

• Modeling a Free-Running Initiator Peripheral– Thread is modeled by SC_THREAD and post

transaction– Wait(sc_time) : To schedule the next-execution of

thread

© 조준동, 2007년 여름 185

Initiator Synchronization

• Modeling a Fully-Slaved Initiator Peripheral– Slaved- Initiator only sends transaction when its target

port is accessed– Loop in Fully-Slaved Initiator returns control to master

thread after it posted transaction

© 조준동, 2007년 여름 186

Initiator Synchronization

• Modeling a Semi-Free-Running-Slaved Initiator Peripheral– Thread containing Loop is triggered by start event– Start event is generated by accessing target port of

initiator

© 조준동, 2007년 여름 187

SystemC Modeling Library (SCML)

• Memories and Bitfield object:– To model bit-field and memory-map registers– Memory object support posting non-blocking

transactions– Support synchronization by read and write data based

on blocking access • Clock object

– To model timing or clock in IPs• Initiator-side object

– Model the communication of initiator peripherals to support re-use.

© 조준동, 2007년 여름 188

Modeling TLM Motion Compensation

• Outline features of Motion Compensation IP– Synchronization : Semi-Free-Running-Slaved

Initiator– Behavior: Algorithm extracted from J.M source

code– Structure includes two part

• Target part: Interface with Master Processor using Register bank and modeling follow Target pattern of SCML

• Initiator part : Modeling the posting of transactions and synchronization of transmission transactions follow Inititator pattern of SCML

© 조준동, 2007년 여름 189

Modeling TLM Motion Compensation

© 조준동, 2007년 여름 190

Modeling TLM Motion Compensation

• Three ports:– pConfig: Interface with Master Processor to receive parameters.– p_Irq: Generate interrupt to synchronization with Master processor– p_Post : Post transactions to specific bus through bus-transactor

• Register bank: for parameters of Motion Compensation block

• StartStopReg and IrqReg: for interface with Master Processor

• Behavior block : for Motion Compensation Algorithm and transaction modeling

• Call-back functions : for events caused by writing StartStopReg and IrqReq.

© 조준동, 2007년 여름 191

Modeling TLM Motion Compensation

• Functions in TLM Motion Compensation Model:– f_initialize(): Init parameters of Model– f_thread (): Wait event generated by writing to StartStopReg– f_write_start_stop(): Call-back function corresponding with event writing

to StartStopReg. It activate or deactivate Model by generating a sc_eventto signal f_thread().

– f_clear_irq(): Clear IrqReg– f_MotionCompensation(): Motion Compensation behavior based on

original source code in J.M reference software.– f_do_post(): Post transactions in storage(transaction pool) to bus

transactor and manage synchronization posting– f_postTransfer(): Post a transaction to bus transactor– f_release_trans(): Release transaction pool

© 조준동, 2007년 여름 192

Next…

• Extract parameters as TestVector from J.M source code

• Build a platform in CoWare• Test Motion Compensation IP with TestVector

© 조준동, 2007년 여름 193

맺음말

• (Mobile) SoC의 complexity 및 cost의 증가로 MP-SOC platform을 이용한 설계 프로세스 중요

• Mobile platform의 challenge로 low power, RF I/F를 포함한 검증, variety of standards, platform optimization 제시

• 여러 platform 및 methodology의 장단점을 취한platform 개발이 바람직

• HW/SW/algorithm을 이해하고 설계할 수 있는 인재(system architect) 육성

Recommended