Multi-processor System on Chip...

Mobile SoC Design Automation Lab. Sungkyunkwan Univ. S. Korea

Multi-processor System on Chip Design

성균관대 조준동

목 차

• 차세대 SoC (System on Chip)의 요구사항

• MPSOC의 필요성

• History of Multiprocessors• MP-SoC Examples and Applications• Homogeneity and Heterogeneity• MP-SoC Design Automation• Network on Chip• SKKU’s Mobile MP-SoC Platform

차세대 SoC (System on Chip)의요구사항

Processor: AP - MCModem: GSM/GPRS - WCDMA - CDMA2000

Connectivity: Wireless LAN - GPS - Bluetooth

RF/Analog: Rx - Tx - Zero IF - PM

Camera Chipset: CIS - CCD - ISP

Display Driver IC (DDI): STN - TFT - OLED

Smart Card: Smart Card: SIMSIM

Flash Memory: Flash Memory: Code/Data StorageCode/Data Storage

SIP / MCPSIP / MCP

RAM: Mobile DRAM - SRAM - UtRAM

What is System on Chip? What is System on Chip?

SoCSoC

고성능 및 저전력의 필요성

3D graphics

Moore’s law

ShannonShannon’’s laws law((2.8x / 18m)

2G (IS-95)9.6kbps

3G (CDMA 1xEV)3,100kbps

4G (1GMbps~100Mbps)

20031995 2012

Battery capacityQVGA

HD (720p)

Full HD (1080i)Mobile MultimediaMobile Multimedia

Design Complexity

Productivity Gap: Design complexity vs. Moore’s law Power Gap: Design complexity vs. Battery

임베디드 프로세서(ARM) 0.5 MOPS/mW

신호처리 프로세서ASIPs, DSPs

3 MOPS/mW

신호처리ASIC

가용성

에너

지효

200 MOPS/mW

10-80 MOPS/mWFPFA

Flexibility-Energy Gap

FPFA : Field Programmable Function Array

Sensor network design space

Wireless embedded systems design space

차세대 SoC의 생산성 증대를 위한 5가지 요구사항

1. High Performance 2. Fast Verification3. Small Form Factor4. Low Power Solutions5. Design-Technology Integration for

Manufacturability

1. High-Performance: CMP +NoC

Heterogeneous Chip MultiHeterogeneous Chip Multi--processor Architectureprocessor Architecture

2004 2007 2010 2013 2016

#. PEs

Source: ITRS 2005 draft

Technology Evolution

2. Fast Verification: Embedded System Level

ComplexityComplexity

MooreMoore’’s Laws Law2x / 18m2x / 18m

NielsenNielsen’’s Laws Law2x / 12m2x / 12mEmbedded SWEmbedded SW

2x / 10m2x / 10m

System specification

Architecture design

RTL design

UML / Java / MatLab

SystemC / ADL

Verilog / VHDL

ctrl1/cmd1/

ack1ack0

TLMTLM

MobileMobileAPAP

32MB32MBNANDNAND

16MB16MBSDRAMSDRAM

~35mm~35mm

mm16MB16MB

SDRAMSDRAM

17mm17mm

FlashSDRAM

Mobile AP

EMI ReductionEMI Reduction

60% Smaller Area60% Smaller Area

▷▷88--layers of MCPlayers of MCP▷▷ Cost reduction by 15%Cost reduction by 15%

3. Small Form Factor

SiPSiP: Mobile Application Processor + Mobile Memory: Mobile Application Processor + Mobile Memory

• MTCMOS

• Clock Gating

• Multi-Vdd

• Tr Sizing • VTCMOS• Multi-Vt• SOI

• High-κ Metal Gate

Device Circuit Architecture Runtime•Parallelization•GALS

DAC 2004

4. Low Power Solutions

1.2V, 350MHz

1.5V, 500MHz

1.0V200MHz

Multi-Vdd

•DPM/DVS

Active

Standby

VTCMOS

• MTCMOS

Statistical Analysis

CriticalTiming,power

Designer’sIntention

5. Design-Technology Integration for Manufacturability (DfM)

VariationInformation

NA, NA, ToxTox

Latency, PowerLatency, Power

Fault ProbabilityFault Probability

VddVdd, Temp, Temp

VtVt, , LgLg, L, t, , L, t, tILDtILD

Quantum Physics

Mask / Process Design

Architecture Design

Logic / PhysicalDesign

Algorithm DesignFault-tolerant algorithm

Yield-improving architecture

Statistical STA

More SoC topics …

• Platform optimization– Power management– BW allocation– Resource sharing– Task distribution– Efficient communications

• Low Power• Verification

•인재 (System Architect) 양성

MP-SoC (Multi-Processor System on Chip)의 필요성

Definition of MP-SOC?

Usually Heterogeneous Multiprocessor:

CPUs, DSPs, etc.Hardwired accelerators.Mixed-signal front end.

Definition of Multiprocessor by Enslow Jr.

MIMD machines with shared memory•Shared memory•Shared I/O•Distributed OS•HomogeneousExtended definition: All parallel machines (wrong usage)

Future Microprocessors

Why MP?

Uniprocessors have hit the ceilingGet performance from better architecture instead of more MHz

Anatomy of a Cellular Phone

3G Wireless Protocols

MP-SoC 응용 분야:4G: Multiple standards

Communications.Networking.Multimedia.Security.

Mutiband/multimode를 지원하는 Digital RF

MP-SoC Platform의 진화 방향(WCDMA+CDMA2000의 예)

System Architecture for 3G

•4 PEs–static kernel mapping

and scheduling–SIMD+Scalar units•1 ARM GPP controller–scalar algorithms and

protocol controls

ARM MPCoreTM 아키텍쳐

재구성 및 Scalable MP-SoC 플랫폼

Road Map to MP-SoC Trends

• Mask NRE: Over 1M$; • Design NRE: 10M$ to 75M$

– ASICs replaced by programmable ASSP, FPGA’s• Number of embedded processors

– DVD/STB/HDTV, mobile phones: 5 to 8• Image proc, networking, basestation: 8 to 100+• E-S/W complexity

– Set-top box, audio: >1 million lines of codeE-S/W becoming essential part of SoC’s

WhoWho’’s Law?s Law?

Why is MP-SOC Challenging?

Software Defined Wireless Multimedia Terminals

•Lower costs–Platform longevity, higher

volume–SW has lower development

costs•Time to market–Future protocols will have

complex implementations–Overlap testing/development

cycles•Adaptability–Standards change over time–Multi-mode operation–Sharing hardware resources

Multistandard Radio

• UMTS• GSM/GPRS/EDGE• WLAN• Bluetooth• UWB

Multistandard M/M• H.264• MP3• AAC• GPS• DVB-H• TPEG

SDR = Reconfigurable Radios

SDR Configuration• Modulation Format

– QPSK– DQPSK– π/4 DQPSK– {16,64,256,1024} QAM– OFDM– OFDM CDMA

• Digital Down/Up Conversion (DDC)– Channel Center– Decimation/Interpolation rates– Compensation Filters– Matched Filter α = {0.25,0.35,...}

• FEC– Convolutional– Reed-Solomon– Concatenated Coding– Turbo CC/PC– (De-)Interleave

Soft RadioDigital Signal

Processing Engine

• Network Interface Definition

• Channel Access– CDMA– TDMA

• Security• Beam Forming

• DSSS– Rake, track, acquire– Multi User Detect. (MUD– ICU

Future mobile applications?

• Mobile supercomputing– Speech recognition.– Cryptography.– Augmented reality.– Typical applications (email, etc.).

• Requires 16x 2 GHz Pentium 4 ?

Mudge et al:

Culture and Education? Personal Entertainment ?

Health, H

uman, Bio

MP-SoC 응용 분야

CIS Mobile

Recorder

Health

HCI Bio

Data Broadcasting

Automotive & Robotics

Telematics UnmannedDriving Robot

MPSoC today

• High performance, low power: there is no other way than MPSoC!

• Virtually all processor vendors are on the MPSoC route– TI: OMAP, DaVinci– STM: Nomadik– IBM: Cell– Intel: IXP, CoreDuo– Philips: Nexperia– Atmel: Diopsis– ARM: MPCore– ARC: VRaptor

• Urgent need for MPSoC design tools– Application design and platform capture– Architecture exploration and optimization– Simulation and verification– Application to architecture mapping

The triangle, Chicken and Egg?

architectures

applications

methodologies

•Hardware and software architectures determine capabilities.•Applications guide design decisions.•Methodologies allow repeatable, predictable design.

Tape Out

VerifyCompose the system

VerifySimulate

VerifySoC Composer

Verify (timing, area)Synthesis + P&R

VerifySimulate (performance)

Should the SoC designer work hard?

Requirements

Mobile SoC에서검증이 왜 중요한지?

왜 우리는 검증이취약하게 되었는지

Some statements from MPSoC 2006 Symposium

• The ad-hoc approach to SoC design cannot scale with Moore’s Law...The SW development environment as afterthought era of IC design is rapidly drawing to a close“ (K. Keutzer, UCB)

• Power-constrained CPUs are mandatory, but the most exciting features require system-level SW optimization“ (M. Kuulusa, Nokia)

• Multi-core platforms are a reality – but where is the SW support?“ (R. Lauwereins, IMEC)

Gartner, 2007년 10대 기술 발표

• 향후 3년간 성숙단계에 이를 것으로 예상되는 10대 기술을발표(2006년 제25차 가트너 데이터 센터 연례회의, 2006.11.28~12.1)

• 오픈소스(Open Source), 가상화(Virtualization), 정보 액세스 (Information Access), 유비쿼터스 컴퓨팅(Ubiquitous Computing), 그리드 컴퓨팅(Grid Computing), 컴퓨트 유틸리티(Compute Utilities), 멀티코어 프로세서(Multicore Processors), 웹 2.0(Web 2.0), 네트워크 통합(Network Convergence), 수냉 방식(Water Cooling)

http://www.gartner.com/2_events/conferences/lsc25.jsp

Fundamental to Parallel Machines

Purposes of multiple processors

• Performance– A job can be executed quickly with multiple

processors

• Fault tolerance– If a processing unit is damaged, total system

can be available： Redundant systems

• Resource sharing– Multiple jobs share memory and/or I/O modules

for cost effective processing：Distributed systems

• Low power– High performance with Low frequency operation

Why Multi-Threaded Cores?

In SRAM

DSPDSPH/W-MTRISC

H/WProc. Element

I$D$ I$

Increasing gap: memory & processor

speeds(2x / 2 years)

Increasing gap: interconnect &

gate delays(multi-clock)

More parallel processing

(lower-power, higher-perf./mm2)

Flynn’s Classification

• The number of Instruction Stream：

M(Multiple)/S(Single)

• The number of Data Stream：M/S– SISD

• Uniprocessors（including Super scalar、VLIW）

– MISD： Not existing（Analog Computer）– SIMD– MIMD

Processors

Memory modules (Instructions・Data）

•Each processor executes individual instructions•Synchronization is required•High degree of flexibility•Various structures are possible

Interconnectionnetworks

Classification of MIMD machines

• UMA(Uniform Memory Access Model)provides shared memory which can be accessed from all processors with the same manner.

• NUMA(Non-Uniform Memory Access Model)

provides shared memory but not uniformly accessed.

• NORA/NORMA (No Remote Memory Access Model)

provides no shared memory. Communication is done with message passing.

An example of UMA：Bus connected

SnoopCache

Main Memory

shared bus

SnoopCache

SMP(Symmetric MultiProcessor)

On chip multiprocessor

Switch connected UMA

Switch

CPUInterface

Local Memory

Main Memory

．．．．

…．

• Each processor provides a local memory, and accesses other processors’ memory through the network.

• Address translation and cache control often make the hardware structure complicated.

• Scalable：– Programs for UMA can run without

modification. – The performance is improved as the system

Competitive to WS/PC clusters with Software DSM

Typical structure of NUMA

Node １

Node 2

Node ３

Node ００

ＩｎｔｅｒｃｏｎｎｅｃｔｏｎＮｅｔｗｏｒｋ

Logical address space

Classification of NUMA

• Simple NUMA：– Remote memory is not cached.– Simple structure but access cost of remote

memory is large.• CC-NUMA：Cache Coherent

– Cache consistency is maintained with hardware.

– The structure tends to be complicated.• COMA:Cache Only Memory Architecture

– No home memory– Complicated control mechanism

Cray’s T3D: A simple NUMA supercomputer (1993)

• UsingAlpha 21064

The Earth simulator(2002)

NORA/NORMA

• No shared memory• Communication is done with

message passing• Simple structure but high pe

ak performance

The fastest processor is always NORA(except The Earth Simulator)

Hard for programming

Inter-PU communications Cluster computing

Early Hypercube machine nCUBE2

MP-SoC Examples and Applications

Dual-Core (DSP+ARM) Platform

IBM Power4

– 2 cores– F = 1.4GHz– Single clock over entire

die– Balanced H-tree driving

global grid– Measured clock skew

below 25ps– Power ~85W– 180nm SOI process,

174M transistors

IBM’s Multiple processors on MCM

- 4 POWER4 chips into single module (MCM)

– The POWER4 chips connected via 4 128-bit buses

– Up to 128MB L3 cache– Bus speed ½ processor

speed– Total throughput ~35

MPSoC “Bus” Alternatives• Fixed Bus [Bergamaschi, DAC, 2000]– Point to point communication– Signals between cores transferreddedicated wires• FPGA-like Bus [Cherepacha, FPGA Sym,– Programmable interconnects– Employ static network• Arbitrated Bus [IDT Inc., 2000]– Time-shared multiple core connectivity– Use arbitrator• Hierarchical Bus [AMBA, ARM Inc]– Combine multiple buses using bus– Separate buses for cores and I/O• NoCBus [Dally, DAC, 2000]– Resources communicate with data packets– Use switch fabric

Available Mobile Processors

• The ARM Family– The ARM7 Generation– The StrongARM– The ARM Thumb Option– The ARM Piccolo Option– The ARM9 and ARM10

• The Motorola M-Core• The LSI TinyRisc• The Hitachi SuperH Family• VLIW Processors

– The Motorola-Lucent Star*Core– The Philips TriMedia– The HP/Intel IA-64

Available MP-Cores

• TI OMAP• Philips’s NexperiaTM DVP• ST Nomadik• Intel® Itanium® Montecito• CELL Processor• CT 3400 Multi-core DSP• Hibrid SoC• Systolic Ring• Virtual platform in SHAPES project

TI OMAP

• Targets communications, multimedia.

• Dual-processor (DSP, RISC) with shared memory

• Hierarchical Definition of Platform

• Critical Role of Software as well as Hardware

• OCP (Open Core Protocol) based SoCplatform

C55x DSP

OMAP 5910:

Memory ctrl

MPUinterface

SystemDMAcontrol

bridge

플랫폼 계층 및 구분

• Level 0: Foundation Platform– Infrastructure & standards : Basic Arch.

• Processor core, Peripheral/Interface IP, Bus: e.g., ARM PrimeXsys

• Level 1: Application specific Integration Platform

• Application Specific SoC: HW & SW• Mobile Platform, Home Platform

• Level 2: System Platform• Terminal Platform• Handset case: RF + Modem + AP + Memory + MMI

Hierarchy of Platforms in OMAP

Reference Design

Application Platform

SoC Platform

ASIC Library & Tools

Silicon Technology

Application Specific

Broadly Applica

OMAP Products

OMAP Infrastructure

System Platform

Scalable Multi-processors

TI OMAP 1510 Platform Architecture

Peripheral Bus

SDRAM Bus(16)

Peripherals: LCD Controller, Interrupt Handlers, Timers, GPIO, UARTs, McBSP

Peripheral Bus

SystemDMA

Peripherals Buses (8/16/32)

MAIL Box

DSP MMU

Traffic Controller

EMIFF EMIFS

SRAMLB MMU

HASB MMU

Flash Bus(16)

LB(32)

HASB(32)

GPP:TI925 Core

- 16KB I-Cache - Write Buffer- MMU and D-MMU- Dual TLB

DSP:C55x CoreInternal Memory

- 48KW SARAM - 32KW DARAM- 16KW PDRAM

24KB I-CacheGraphic HW AcceleratorARM Port Interface

IPC- Mail Boxes - API- DSP MMU

System DMATraffic ControllerInternal SRAMBussesPeripherals

TI OMAP 1510 Platform S/W

TI925 General-Purpose Processor

OS kernel& drivers

TMS320 DSP

OS adapter LINK driverMCU Bridge Kernel

RESOURCE MANAGER

LINK driver Other drivers

DSP/BIOS KernelRM Server

MP3 AMRMEDIA APIs

raw data streams video audio speech

XDAIS AlgorithmsEncapsulated in socket nodes

Node Data Base

Philips’s NexperiaTM DVP

(source: Th. Claasen, Philips, DAC 2000)

Philips NexperiaTM DVP S/W Reference Architecture

Analog Inputs

Analog Front End

Digital Inputs

Analog Front End

Optical Drive

Network Protocols

Hard Disk

Digital Front Ends

Network Protocols

Players

Broadcast-MPEG2

VCD/SVCD

CD/SACD

Broadcast-MPEG4

Recoders

DVD+RW Auth

PVR-SPTS

Lo-Rate SPTS

CD/DVD-MP3

• • •

Transcoders

Translaters

TS-SPTS Filter

Loopback / Feedthrough

Digital Outputs

Protocol Stack

Network Protocols

Driver

HDD/Ethernet

Presentation Engine

Audio and Video Processing

Philips NexperiaTM DVP MP-SoC

• Philips's advanced set-top box anddigital TV SoC (Viper2)

• 0.13 μm• 50 M transistors• 100 clock domains• > 60 IP blocks

ST Nomadik

• Targets mobile multimedia. A Heterogenousmultiprocessor-of-multiprocessors.

Power Distribution 인텔 제온 프로세서

Clock and Power Convergence

Dynamic voltage and frequency scaling (DVS)

Intel® Itanium® Montecito - Clock system architecture

– Each core split into 3 clock domains on variablepower supply

Intel® Itanium® Montecito - Power management

– Dynamic voltage-scaling power management system– 4 on-die sensors– On-die microcontroller– Power and temperature measurement– Voltage and frequency modulation– 8μs power/temperature sampling interval– Embedded firmware– Power, temperature, or calibration measurements– Power: closed-loop power control and system

stability check– Temperature: thermal sensor readout (junction

temperature below 90°C monitoring) and power-control communication

– Calibration: power-measurement accuracy check

The implementation of a first-generation CELL Processor

The Cell Processor

• Fclock > 4 GHz.• Memory bandwidth: 25.6 GBytes per second.• I/O bandwidth: 76.8 GBytes per second.• Performance:

– 256 GFLOPS (Single precision at 4 GHz).– 256 GOPS (Integer at 4 GHz).– 25 GFLOPS (Double precision at 4 GHz).

• 235 square mm.• 235 million transistors. • Power consumption estimated at 60 - 80 W @ 4GHz

Cell’s Element Interconnect Bus

• 4 rings (2 ckwise + 2 counter-ckwise)• No token rings, still request/grant arbitrations

CT 3400 Multi-core DSP

• 8개 32비트DSP 코어

• 6개 32비트 범용프로세서 코어

• 128핀 프로그램가능 I/O 서브시스템으로 구성

• C 프로그램 가능

• H.264 및MPEG4 코드를지원

http://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

H.264 encoder , decoder and audio codecs and the system control

H.264 codec onto CT3400 MDSP

From cradle

CT3400 DPS Engine

http://www.cradle.com/downloads/Efficient_H.264_Mapping.pdfhttp://www.cradle.com/downloads/CT3400_Datasheet_DS0209.pdf

DSP Engine

Each DSP engine contains

A Single Instruction Multiple Data

Arithmetic Logic Unit (SIMD ALU)

A Packed Integer Multiplier

Accumulator (PIMAC)

A Floating Point Unit (FPU)

Bi-directional FIFO data buffers

DMA channels

A 128 x 32 register and

A 512 x 20 program memory

CT3600 Multiprocessor DSP Family Members

• CT3616은 채널 당 5.50 달러(MPEG4 SP L3)로 업계에서 가장 뛰어난가격 대 성능비 인코딩 솔루션을 제공하고 있어 가장 가까운 경쟁 제품보다 2배 이상 우수

• 프로그램 가능 DSP를 기반으로 하는 단일 칩 실시간 D1 H.264 메인 프로파일 비디오 인코더를 업계 최초로 구현한다

• 0.13미크론 기술, 16개의 DSP, 8개의 범용 프로세서로 전체 성능을 네배로 증가

• 40달러에서 90달러

http://www.cradle.com/downloads/CT3600-PB.pdf

HiBRID-SoC Architecture

Multi-Core SoC Architecture Dedicated chips

for the Mpeg-4 Simple Profile

Integrate a powerful on-chip communication

structure

Three programmable cores: Each adapted

towards a specific class of algorithmsInstruction Level VLIW (Very long

instruction word)Data Level SIMD (Single instruction

multiple data)Task Level (Simultaneous

multithreading)

Developed at the University of Hannover

Multi-Core SoC Architecture

• Hi-par DSP• 16-datatath SIMD processor core controlled by VLIW,• Particularly optimized towards high-throughput two

dimensional DSP-style processing• (FFT-intensive applications or filtering)

• Stream Processor (SP)• 32-Bit RISC architecture that is more optimized to-wards

control-dominated task• Bitstream processing or global system control

• Macroblock processor(MP)• Efficient processing of data blocks (Heterogeneous data

path structure consisting of scalar and a vecture unit)• Controlled by dual-issue VLIW, offers flexible subword

parallelism, and contains instruction set extensions for typical processing computation steps

HiBRID-SoC multi-core architecture

64-bit AMBA AHB system

Connects all cores

SDRAM memory via a

64 Bit SDRAM

interface

Two versatile 32-Bit

host interfaces for

access (e.g., host PC

via PCI and to serial

flash memory)

HiPAR-DSP

Highly paralled DSP core with a

VLIW-controlled SIMD

architecture

DMA unit serves all cache misses

and performs data prefetch

transfers to the matrix memory

At the targeted clock frequency of

145 MHz, the HiPAR-DSP

achieves a performance of 2.3

Macroblock processorHeterogeneous data path structure consisting

of a scalar and a vector data path

The scalar data path operates on 32-Bit data

words in a 32-entry register file and provides

control instructions (jump,branch, and loop)

The vector data path is equipped with a 64

entry register file of 64 bit width

Special fuction unit(SFU) provide

instruction set extensions for common video

and multimedia core algorithms.

MUL/MAC or ALU, incorporate SIMD-

style subword parallelism by processing

either two 32-Bit, four 16-Bit, or eight 8-Bit

data entities in parallel within a 64-bit

register operand

HiBRID-SoC Implementations

Chip layout of the HiBRID-SoC.

MPEG-4 ASP decoder (full TV resolution) performance on MP and SP, 720*576@25Hz,1.5-3 Mbits:

HiBRID-SoC is fabricated in a 0.18 um,

6LM standard-cell technology,

14 million tr’s 3.5W

82 mm2, 145 MHz

New Taxonomy/Metric

• Flynn: Triple (d,i,c)d: # of data streamsi: # of instruction

streamsc: # of configuration

states

SISD, SIMD, MIMD,MISD

• RA: (c,g,a)– c: configurability to

various environment– g: size of granularity– a: adaptability to

various components

– SCSG,SCMG,SCLG– MCSG,MCMG,MCLG

Systolic Ring

• Based on a coarse-grained configurable PE

• Circular datapathsC: # of layers C = 4N: # of Dnodes per layer

N = 2S: # of Rings s = 1

• Control Units (sequencer)Local Dnode unitLocal Ring unitGlobal unit

Dnode Dnode

SwitchSwitc

Switch

layer 1

layer 2

layer 3

layer 4

Dnode Sequencer

Local RingSequencer

Motivation For Using Hierarchical Rings

• Relatively simple switching logic reduces the complexity at each node resulting in reduced buffer, area and energy requirements.

• Low latency since packets are forwarded in 1 clock cycle.

• Packets will always arrive in-order at the destination.

• Broadcast and Multicast packets are efficiently implemented.

• Hierarchical rings can be partitioned into independent clock domains.

Remanence

FcNcFeNR PE

• NPE: # of processing elements (PE) • Nc: # of PE configurable per cycle• Fe: operating frequency • Fc configuration frequency

Characterizes the Dynamism• # of cycles to (re)configure the whole architecture• Amount of data to compute between 2 configurations

Interconnection

PE PE PE PE PE

Configuration Memory

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0

Sequencer

Interconnection

PE PE PE PE PE

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0

Sequencer

Operative Density

NPE: # of PE

A: Core Area (relative unit λ²)

Area can be expressed as a function of NPE

PEPE NA

NNOD =

Interconnection

PE PE PE PE PE

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0Sequencer

Interconnection

PE PE PE PE PE

Processing Elements

Routing

Sequencing Unit

…inst3inst2inst1inst0Sequencer

Remanence formalisation

• # of layers : C = 8• # of Dnode per layer : N = 2• 1 Systolic Ring: S = 1

0 20 40 60 80 100 120 140 160 180 # Dnodes

REMANENCE

k = 2k = 4

0 20 40 60 80 100 120 140 160 180 # Dnodes

REMANENCE

Switch

Dnode Dnode

Switch

Dnode Dnode

Switch

Dnode Dnode

SwitchSwitc

Switch

Dnode Dnode

Switch

Dnode Dnode

Switch

Dnode Dnode

Switch

SwitchSwitc

Switch

layer 1 layer 2

layer 3

layer 4

layer 5layer 6

layer 7

layer 8

k = 1k = 1

k = 2k = 4

PEPE NkNR .)( =

k= C/N

Architectural model Characterization

Dnode Dnode

SwitchSwitc

Switch

Dnode Dnode

SwitchSwitc

Switch

Dnode Dnode

SwitchSwitc

Switch

Dnode Dnode

SwitchSwitc

Switch

Global Bus

Dnode Dnode

SwitchSwitc

Switch

Dnode Dnode

SwitchSwitc

Switch

Dnode Dnode

SwitchSwitc

Switch

Dnode Dnode

SwitchSwitc

Switch

Dnode Dnode

Switch

SwitchSwitc

Switch

Dnode Dnode

Switch

SwitchSwitc

Switch

Dnode Dnode

Switch

SwitchSwitc

Switch

Dnode Dnode

Switch

SwitchSwitc

Switch

Global Bus

Global Sequencer

Local RingSequencer

# of layers : 4 (C = 4) # of Dnode per layer : 2 (N = 2)4 Systolic Ring (S = 4)

Control Units• Local Dnode unit• Local Ring unit• Global unit

•www.qstech.com

Best OD and remanence

0 20 40 60 80 100 120 140

# Dnodes

Remanenc

anence

0 20 40 60 80 100 120 140

# Dnodes

Remanenc

anence

Design SpaceWorst interconnect resources and processing power

Worst OD and remanence

0 20 40 60 80 100 120 140

# Dnodes

Remanenc

anence

0 20 40 60 80 100 120 140

# Dnodes

Remanenc

anence

Design SpaceBest interconnect resources and processing power

Comparisons of RA

1. Only 1 cycle to (re)configure the DSP

2. Few cycles to (re)configure coarse grain RA (≤8)

3. Many cycles to (re)configure fine grain RA

NPE Nc RName Type F (MHz)

2304 0.14 16457

24 4 6

128 16 8

ARDOISE

Systolic Ring

MorphoSys

TMS320C62

Fine Grain RA

Coarse Grain RA

DSP VLIW 8 8

FcNcFeNR PE

=Pascal BENOIT

Virtual platform in SHAPES project

Homogeneity and Heterogeneity

MPSoC Architecture Trends

1~8 2~6

Exploitable Parallelism

GP O/SThread-LevelParallelism

Instruction-Level

Parallelism

10 000’sInstructions

Min parallel grain size (instrns.)

Exploitable taskparallelism

MultiFlex Thread-Level

Parallelism

100’s

Three Levels of Parallelism

Parallel Heterogeneous Platforms (PHPs)

• Challenges:– Explore the theoretically high performance

Platform Company PEs Het?

Cell IBM/Sony/Toshiba 9 Y

DRP NEC 512 N

Nomadik ST 3+ Y

OMAP2420 TI 4 Y

Nexperia Philips 3+ Y

X-Fi Creative 7 Y

ARM11 MPCore ARM 1-4 N

IXP2800 Intel 17 Y

MXP5800 Intel 54 Y

… … … …

(From Abhijit Davare’s Quals Presentation)

Homogeneous MP-SOC

• 32bit ARM processors• Private Memory• Shared Memory• Hardware interrupt module• Hardware semaphore

module• 32bit interconnection

(AMBA Bus or STBus)• Porcessor Core modeling :

C++• Hardware interconnection

modeling : SystemC

NEC MP211: Homogeneous MP core

• Asymmetric mp with very coarse grain multitasking

• 3 ARM9’s utilized as predefined function units

• NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing

•Asymmetric mp with very coarse grain multitasking•3 ARM9’s utilized as predefined function units•NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing

MP211 block diagram

Power consumption of H.264+AAC

H.264 video decoder(QVGA 15fps)와 MPEG2 AAC decoder(48K Stereo 128kbps)DTV: 87mW(exclude I/O, SDRAM), 124mW(include I/O, SDRAM)L0의 영역은 기본적인 전력의 소모를 뜻하며, L1 영역은 IP에서 높은 IP 전력소모가실행되고 있는 영역을 뜻한다.

Homogeneous MP의 문제점

▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 크게 된다.

▷ 같은 (호모지니어스) 프로세서를 여러 개 사용하는 것은

자원 유용도가 낮아서 리니어로 전력량이 늘어나게 된다. ▷ 인터콘넥트는 와이어-의존 뿐아니라 로직 의존적이기도 하

다. ▷ 프로세서가 와이어와 메모리 지연시간에 의해서 제약된다. ▷ 특정 응용분야에 대해서만 최고 성능을 낸다. ▷ 온 칩 인터콘넥션의 설계가 코어와 캐쉬와 분리해서 독립적

으로 설계되었다.

Heterogenous MP Core

If two or more cores share L2, the way a lot of present CMPsdo, a crossbar provides a high bandwidth connection.

Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level parallellism,instruction level parallelism을 동시에 처리 가능하도록 설계되었다.

Heterogeneous MP core

• 쉬운 하드웨어 Implementation이 가능하다. : 즉, 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른 하드웨어 개발기간과 가격을낮출 수 있다.

• 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다.

• Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다.

• Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 다중 어플리케이션간 인터페이스를 줄일 수있다.

• 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이 가능하다.

Heterogenous MP Core

▷ Homogeneous CMP (Chip Multiprocessor)와 비교해서 Heterogeneous

CMP(또는 asymmetric CMP)는 많은 장점을 가지고 있다. 많은 응용 제품들은

큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다.

▷ Multi-ISA multicore architecture는 vector/data-level parallellism,

instruction level parallelism을 동시에 처리 가능하도록 설계되었다. 코어

숫자와 크기, 타입, 그리고 캐쉬를 결정해야 한다. 8-core 프로세서의 경우,

인터콘넥트의 전력 소모량은 하나의 코어와 같다.

▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을

이용하는 heterogeneous processors는 homogeneous에 비해서 63%

성능이 개선된다. 5-8 threads level을 사용하는 경우에는 평균 29%의

개선이 있다.

▷ 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며, 병렬 부분에

대해서는 전력 소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를

최대화 한다. [Annavaram, et al]

NEC’s Asymmetric(or Heterogeneous) Multi processing

• 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른하드웨어 개발기간과 가격을 낮출 수 있다.

• 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어 멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다.

• Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다.

• Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 어플리케이션간 인터페이스를 줄일 수 있다.

• 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이가능하다.

Heterogeneous MP-SoC 문제점들

• Processors are bound by wire and memory latencies

• Peak performance on only a small class of applications.

• How well they map to a given design• Diversification of workloads • Increased hardware complexity • Poor resource utilization

AMP task allocation image

Bus and Memory Architecture

Alpha cores scaled to 0.10 um.

EV8 is 80 times bigger but provides only two to three times more single-threaded performance

Equal-area heterogeneous architectures with multithreaded cores

Exploring the potential from heterogeneity

MP-SoC Design Automation

Optimization and Synthesis

• Computation Synthesis:– Task Allocation–Task Scheduling

• Communication Synthesis:– Interconnection Synthesis–Buffer sizing

Energy-Aware Task mapping

Minimize Energy Consumption, given a CTG and a heterogenous NoC• Find:

– A mapping function M : tasks(T) => PEs (P)– Assuming the tasks are already scheduled and partitioned

• Solution formulated as a quadratic assignment problem and solved using Branch and Bound.

• Communication-optimal task mapping– minimal hardware (buffers and wires) required to

meet the timing requirements defined in the specification.

– given a multiprocessor network find a mapping of the application satisfies the timing constraints.

• Genetic algorithm (Chromosome, Generation, Crossover, mutation)

Addressed by Hu et al 2002:

Interconnection Synthesis– With each new

technology:– Gate delay decreases

~25%– Wire delay increases

– Cross-chip communication increases

– Clock needs multiple cycles to cover die

Source: SIA NTRS Projection

Interconnect Delays & Density

Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology

Buffer Sizing

• Architectures have bounded buffer resources.• If more communication buffer resources are

utilized, processors may spend less time waiting to send/receive data.

• Additional buffer resources may adversely affect communication overhead, achievable clock speed, or design closure.

Multiple Clocks due to Interconnect limitation

MPSoC HW platform perspective

• Today´s platforms are quite heterogeneous– Reasons: efficiency and legacy IP

• Homogeneous MPSoC would scale welland would simplify programming– Works well for desktop PCs– Too inefficient for embedded apps

• Mixed MPSoC as a compromise?– Globally homogeneous, locally heterogeneous– (re)configurable PEs

www.iss.rwth-aachen.de

Future MPSoC programming

• Sequential-to-parallel code generationC code (and platform/RTOS model) in,

• parallel C codes out» Step 1: exhibit parallelism at block/task level

to the user for manual mapping» Step 2: automate code partitioning/mapping

• Massive use of compiler technology, e.g.data flow analysis

• Use of „platform refinement“ technology asbackend for machine code generation and

simulation www.iss.rwth-aachen.de

The von Neumann inheritance

• Sequential programming of sequentialmachines– Pascal, Modula-2, C, C++, Java, ...

• Sequential programming of parallel machines?– VLIW: handled by sophisticated compilers– SIMD: will be accomplished by compilers– Does not scale to heterogeneous MPSoC

with– distributed control paths!

• Parallel programming of parallel machines!– We need to move ... to parallel thinking and

programming...We are standing at the verybeginning...It´s a huge area. (J. Gutknecht,

– ETH Zurich)• What to do in the meantime?

Block clustering approach

Block clustering approach –Cn’t

멀티코어 SoC 설계방법

Network On Chip

Technology Evolution

What are NoC’s?

• According to Wikipedia:

– “Network-on-a-chip (NoC) is a new paradigm for System-on-Chip (SoC) design. NoC based-systems accommodate multiple asynchronous clocking that many of today's complex SoC designs use. The NoC solution brings a networking method to on-chip communications and claims roughly a threefold performance increase over conventional bus systems.”

Network-on-Chip (NoC)

• Communication is achieved by connecting switches together to form a network topology:

• Offers much greater scalability.• parallelism: multiple components can send

data simultaneously• energy efficient: point-to-point connections

require less energy than a bus.• Global synchronization is no longer needed.

NoC Design Considerations (I)

• There are several popular topologies:– 2D Mesh (most popular).– Torus (rings)– Tree (fat-tree, butterfly fat-tree)

• The on-chip interconnection network will soon be a limiting– factor for performance and energy consumption:– has been reported to account for over 50% of the total

energyrequirement!

• The interconnect should consume the fewest resourcespossible and should be:– area efficient: switches should be as small (simple) as

possible.– energy efficient: related to area efficiency– fast: simple routing algorithms should be used.

ProcessorMaster

GlobalMemory

Global I/OSlave

ProcessorMaster

RoutingNode

NoC exemplified

NoC: Good news

☺ Only point-to-point one-way wires are used, for all network sizes.

☺ Aggregated bandwidth scales with the network size.

☺ Routing decisions are distributed and the same router is re-instanciated, for all network sizes.

☺ NoCs increase the wires utilization (as opposed to ad-hoc p2p wires)

Sergio Tota and Mario R. Casu

There’s no free lunch…

Internal network contention causes (often unpredictable) latency.The network has a significant silicon area.Bus-oriented IPs need smart wrappers.Software needs clean synchronization in multiprocessor systems.System designers need reeducation for new concepts.

Sergio Tota and Mario R. Casu

Facts about NoC’s

• It is a way to decouple computation from communication

• The design is layered (physical, network, application…): Taming complexity is made easier

• Communication between processing elements in NoC takes place by encapsulating data in packets

• The elementary packet piece to which switch and routing operations apply is the flit

Topologies• Heritage of networks with new constraints

– Need to accommodate interconnects in a 2D layout– Cannot route long wires (clock frequency bound)

a) SPIN, b) CLICHE’c) Torusd) Folded toruse) Octagonf) BFT.

SPIN (Guerrier et al., DATE ’00/’03)

• Wormhole switching, adaptive routing and credit-based flow control. • It is based on a fat-tree topology.• A flit is only one word (36 bits, 4 bits are for packet framing). • The input buffers have a depth of 4 words

Dally et al., DAC’01• 2D folded torus topology• Wormhole routing and Virtual Channels (VC)

Kumar et al., ISLVLSI’02

• Chip-Level Integration of Communicating Heterogeneous Elements, CLICHÉ’• 2D Mesh Topology• Message Passing

Pande et al., TCOMP’05 • Butterfly Fat Tree• Wormhole, Virtual channels• Header flits: 3 ck cycles latency (input arbitration, routing, output arbitration)• “Body” flits: 3 ck cycles (input arbitration, switch traversal, output arbitration

Goossens et al., IEE CDT’03

• Both VCT and WH, GT and BE, IQ and VOQ

• GT uses TDM to avoid contention and create virtual circuits. In each time slot a block of 3 flits is transferred from In “j” to Out “k”in a S&F fashion.

• BE uses Matrix Scheduling• GT connections set up by BE

special system packets• Prototype with WH and IQ

– 5 ports– 0.13 um, 0.26 mm2 , 500/166 MHz– Flit size = 3 words, each 32 bits– 80 Gb/s aggregate bandwidth

SKKU’s Mobile MP-SoC Platform

Koonshik Cho & Jun Dong ChoMobile SoC Design Automation Lab.

Sungkyunkwan Univ.

1. Multiprocessor SoC 설계 Platform

• SW-성능 개선과 표준 변동에 능동적으로 대처

• HW- Modular, Flexible and Scalable Architecture

• Platform based design

2. Multiprocessor SoC Platform test

• DVB-T Receiver

3. Tools

• Seamless CVE (Mentor Graphics)

• ADS(ARM)

SoC (DSP+ARM) Platform

Extended multi-processor platform

ARM Platform

AMBA BUS (1)

AMBA BUS는 Multiplexer, Arbiter, Decoder가 있어 여러개의 Master와 Slave를 중재해 주는 역할을 한다.

AMBA BUS (2)AMBA Bus (Master to slave multiplexer)

• Bus Master는 Address나 Control signal들을 Slave로 내보냄으로 Read 나Write 등의Operation을 할 수 있도록 해 주는 장치이다. 동시 간에 하나의Master만이 전송을 가능하게 한다. 또한 Multiple master가 가능하다

AMBA BUS (3)

AMBA Bus (Slave to master multiplexer)

• Bus Slave는 주어진 Address-space안에서 Master의 Read와 Write를 가능하게 해주는 장치이다. Slave는 Ready 및 Response signal을통해 동작 상태에 대해 Master에게 알려준다. 또한 Multiple slave가가능하다.

AMBA BUS (4)

• AHB Arbiter : Bus Arbiter는 한번에 오직 하나의 Master가 선택

되도록 하는 역할을 한다. 고유의 Priority scheme을 가지고 이러

한 Arbitration을 하게 되는데, AHB에는 오직 하나의 Arbiter가 존

재하게 된다.

• AHB Decoder : AHB Decoder의 역할은 Master로 나오는

Address의 상위 비트를 가지고서 적절한 Slave를 선택해 주는 것

이다. AHB에는 역시 하나의 Decoder가 존재한다.

• APB Bridge : APB (Advanced Peripheral Bus)상의 유일한 Bus Master이다. APB Bridge는 ASB의 Slave로서 Decoder에서 APB가 선택이 되었을 때는 APB 상에서 Master의 역할을 하게 된다

APB Bridge는 Slave module로 Local peripheral bus를 대신해

서 Bus handshake와 신호 Retiming을 조정한다.

AMBA BUS (5)

• Interrupt controller : 최대 32개의 Interrupt source로부터 Interrupt request 신호를 받아서 ARM9 프로세서에 인가되는 nIRQ 또는 nFIQ 신호를 생성한다. 32개의 Interrupt source 중에서 0～3번 Interrupt source가 nFIQ, 4～31번 Interrupt source가 nIRQ를생성한다. 낮은 번호일수록 높은 우선순위를 가진다.

• Timer :Timer 모듈에서는 3개의 Timer 기능을 제공한다. 각 Timer는 16bit counter로서 16, 256, 4096의세 가지 Prescale을 지원하며, 매 주기마다 Counter값을 1씩 감소시키고, Count값이 0이 되면 Interrupt를발생시킨다. ARM9 프로세서가 Timer interrupt clear 레지스터를 통해 Interrupt ack 신호를 줄 때까지Interrupt request를 유지한다.

Teak DSP Platform

• 전제 플랫폼에서 Co-프로세서인 Teak DSP 플랫폼의 구조

Configuration of crossbar switch

• Communication interface Architecture (Crossbar 구조)

재구성 가능한 크로스바 스위치

VHDL 의 generate문을 사용

재구성 가능한 크로스바 스위치(VHDL code)

entity CI_TOP isgeneric ( number_of_masters , number_of_slaves : integer);

port ( …생략);

end CI_TOP;

CI 모듈의 entity (ci_top.vhd)

COMMUNICATION_INTERFACE : CI_TOPgeneric map( number_of_masters=>4 , number_of_slaves =>6)

port map( …생략);

CI 모듈의 사용(multiplatform.vhd)

Advantage Disadvantage

‣비교적 쉽게 구현 가능‣Master, Slave 가적은 경우 효과적

‣Processor 간 병렬처리가 어려움‣시스템이 확장될 경우병목현상을

발생

Crossbar

‣Processor 간 효과적인 병렬처리가 가능‣시스템이 확장되어도같은 Delay를 가짐

‣구현이 어려움‣Size 및 low-power면에서 비교적불리함

Communication Interface Mux vs Crossbar

Interconnection network

Omega interconnection Octagon interconnection

Mesh interconnection

장점 단점

Shared bus ‣비교적 쉽게 구현 가능

‣마스터, 슬레이브가 적은

경우 효과적.시스템이 확장되어도 같은 Delay를 가짐

‣프로세서 간 병렬처리 힘듦

‣버스 효율 낮음

‣전력소모 많음 (broadcasting) ‣구현 복잡도 - 낮음

Crossbar ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 우수

‣데이터 path - 보통

‣구현 복잡도(라우팅 및 스케줄링) - 보통

‣확장에 따른 Size 및 wiring 증가

Omega network

‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 보통

‣데이터 path - 우수(짧음)

‣구현 복잡도(라우팅 및 스케줄링) 높음

‣확장성이 다소 떨어짐

Octagon ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 보통

‣데이터 path - 우수 (가장 짧음)

‣구현 복잡도(라우팅 및 스케줄링) 높음

‣확장성이 다소 떨어짐 (마스터, 슬레이브의 개수 8개로 제한)

Mesh ‣프로세서 간 병렬 처리 가능

‣확장성 및 유연성 - 우수

‣데이터 path - 보통

‣구현 복잡도(라우팅 및 스케줄링) 매우높음

‣중, 대형 시스템에 적합

Interconnection network

네트워크 라우터 셀

– 멀티프로세서 플렛폼으로 4개의Master와 6개의 Slave구조

– CI Cell은 24개의 2by2 mux 구조로 설계

– CI Controller => Req, Grant, mux control etc.

– Seamless CVE와 Modelsim을연동한 상태에서 ARM926,Teak DSP가 동시에 slave에 접근하여각각의 데이터를 Read & Write 플랫폼 Function Block

– 각 Master가 Slave(Ips)로 접근시 CI Controller내부 기능은Request신호, Grant 신호 및 각Mux Control 제어신호, Round Robin기능, Decoder기능 수행 Ci Controller inner block

CI-controller State Diagram

CI controller simulation waveform

DVB-T Baseband Receiver

Hardware-software co-design flows

A shared memory structure and hardware-software partitioning

Frequency offset compensator hardware

Fine and Coarse Frequency Synchronizer (Beek & Classen)

FFT block diagram

Equalizer hardware block diagrams

DVB-T baseband Receiver Scheduling

Performance evaluation

Processing Types /

Functional BlockSW

SW & HW (Teaks + ARM +

HW IP)

HW(IP) only MAL

Frequency compensator & Remove Guard - 182.5us 13.8us 10.5us

Fine Freq. sync. (Beek) - 56.3us 1.5us 7.8us

Symbol Timing Recovery 144 us - - 5.2us

FFT - 188.9us 38.6 us 13.6us

Coarse Freq. Sync. (Classen) - 241us 3.3us 11us

Scattered Pilot Detection 46.5us - - 3.3us

Equalizer - 219.5us 11.2us 9.5us

De-mapping 19.9us - - 4.9us

Task Chart of Multi-processor platform for DVB-T baseband receiver

Modeling of Motion Compensation IP using SCML

Le Minh Nghia & Jun Dong ChoMobile SoC Design Automation Lab.

Sungkyunkwan Univ.

Introduction of SystemC Modeling in CoWare

• TLM Peripheral Modeling with CoWare• SystemC Modeling Library (SCML)• Motion Compensation Modeling using SCML

TLM Peripheral Modeling with CoWare

• Four use-cases for Transaction-Level Modeling (TLM)– Functional View (FV)– Architecture’s View (AV)– Programmer’s View (PV)– Verification View (VV)

• General pattern for modeling peripheral component– Separate Behavior, Communication and Timing– Initiators and Targets depending platform are created by user.– Bus-transactor convert a generic communication into a bus-specific TLM

interface.– Accuracy of Timing depends on use-case

• Communication– Communication through function calls– Simulation speed strongly depends on bus-model– PV bus-model used for software development can

be simulated very fast• Behavior

– Functionality– Synchronization– Storage

• Timing– Modeling timing model based on clock object in

SystemC Modeling Library (SCML)

• Modeling Target pattern– Communication : Bus-transactor– Storage and Synchronization : Register bank as interface– Behavior: Collection of call-back functions, each call-back

corresponding to a bitfiel or register in register bank

• Modeling Initiator pattern– Communication : Bus-transactor convert posted transactions in

queue into real bus transaction– Storage and Synchronization : Include Post port and initiator storage

element scml_array (in SCML). Post port post transactions in term of nonblocking. The real synchronization depends on data and space in storage element which related to scml_array object

– Behavior: Modeled by autonomous SystemC processes

Initiator Synchronization

• Two class of initiator blocks:– Free-running initiator: all transfer initialized by

block do not need any accesses from another peripheral

– Initiator block has target port and transfers will only be initialized

• Three pattern synchronization of Initiator block:– Free-running Initiator– Fully Slaved Initiator– Semi-free running Initiator

• Modeling a Free-Running Initiator Peripheral– Thread is modeled by SC_THREAD and post

transaction– Wait(sc_time) : To schedule the next-execution of

thread

• Modeling a Fully-Slaved Initiator Peripheral– Slaved- Initiator only sends transaction when its target

port is accessed– Loop in Fully-Slaved Initiator returns control to master

thread after it posted transaction

• Modeling a Semi-Free-Running-Slaved Initiator Peripheral– Thread containing Loop is triggered by start event– Start event is generated by accessing target port of

initiator

SystemC Modeling Library (SCML)

• Memories and Bitfield object:– To model bit-field and memory-map registers– Memory object support posting non-blocking

transactions– Support synchronization by read and write data based

on blocking access • Clock object

– To model timing or clock in IPs• Initiator-side object

– Model the communication of initiator peripherals to support re-use.

Modeling TLM Motion Compensation

• Outline features of Motion Compensation IP– Synchronization : Semi-Free-Running-Slaved

Initiator– Behavior: Algorithm extracted from J.M source

code– Structure includes two part

• Target part: Interface with Master Processor using Register bank and modeling follow Target pattern of SCML

• Initiator part : Modeling the posting of transactions and synchronization of transmission transactions follow Inititator pattern of SCML

• Three ports:– pConfig: Interface with Master Processor to receive parameters.– p_Irq: Generate interrupt to synchronization with Master processor– p_Post : Post transactions to specific bus through bus-transactor

• Register bank: for parameters of Motion Compensation block

• StartStopReg and IrqReg: for interface with Master Processor

• Behavior block : for Motion Compensation Algorithm and transaction modeling

• Call-back functions : for events caused by writing StartStopReg and IrqReq.

• Functions in TLM Motion Compensation Model:– f_initialize(): Init parameters of Model– f_thread (): Wait event generated by writing to StartStopReg– f_write_start_stop(): Call-back function corresponding with event writing

to StartStopReg. It activate or deactivate Model by generating a sc_eventto signal f_thread().

– f_clear_irq(): Clear IrqReg– f_MotionCompensation(): Motion Compensation behavior based on

original source code in J.M reference software.– f_do_post(): Post transactions in storage(transaction pool) to bus

transactor and manage synchronization posting– f_postTransfer(): Post a transaction to bus transactor– f_release_trans(): Release transaction pool

Next…

• Extract parameters as TestVector from J.M source code

• Build a platform in CoWare• Test Motion Compensation IP with TestVector

맺음말

• (Mobile) SoC의 complexity 및 cost의 증가로 MP-SOC platform을 이용한 설계 프로세스 중요

• Mobile platform의 challenge로 low power, RF I/F를 포함한 검증, variety of standards, platform optimization 제시

• 여러 platform 및 methodology의 장단점을 취한platform 개발이 바람직

• HW/SW/algorithm을 이해하고 설계할 수 있는 인재(system architect) 육성

Multi-processor System on Chip...

Documents

· Arch. Borrelli Geom. Ropa Geom Ropa Arch Borrelli Comuni montan Arch Borrelli Arch. Borrelli Geom Costa SAB SAB Spea Autostrade AC calcio SAB Riqualificazione antistante Area

DAJLA RIERA - Unicam...Riera Dajla (mandante) e Arch. Brownlee Timothy (mandante) • Data di consegna Novembre 2018 • Ente Promotore Molini e Pastifici 1875 soc. agr. a r.l. •

PIANO STRUTTURALE NORME TECNICHE DI …...Arch. Alessandro Bernardini - Arch. Francesco Cecchi - Arch. Andrea Lotti – Arch. Federico Nerozzi - Arch. Elena Sardi - Arch. Serena Zarrini

Arch Wire Forms...Arch Wire Forms アーチワイヤーフォームズ NATURAL ARCH FORM I NATURAL ARCH FORM Ⅲ VLP ARCH FORM Title iconixオーダーシート_定価 Created Date

2006년도IT SoC 주요국책과제분석 · 사업중IT SoC 분야의2006년도주요 선도과제후보에대하여 분석된내용을다루었다. (본내용은IT SoC 신규과제기획중

Review of Processor Architecturesvada.skku.ac.kr/ClassInfo/comp-arch/SoC-arch/%b8%f0%b5%e...– Two read, one write port (2R1W) register file – Write muxfor register file from ALU

Cours Arch

Roma Capitale | Sito Istituzionale | Welcome - Elenco ......Centi Franco Arch. Cesarini Pierfilippo Arch. Chilà Giuseppe Arch. Ciambella Daniela Arch. Ciampelletti Isa Giovanna Arch

30bib Arch

1) PROFILO DEI PROFESSIONISTI ASSOCIATI. Tecnostudio.pdfW.Bellotta & Arch. P. Conti ass. COMMITTENTE Soc. FINCENTRO UNO s.r.l. Roma via Tor Cervara N 246-8 Tel. 06 - 221981 IMMOBILE

Hpu 2011 lec2 soc mediatica soc digital

Arch linux

Einsteins arch

DEL MEDITERRANEO Arch. Francesca Sartogo ---8 PPT SARTOGO... · arch. Massimo Bastiani prof. arch. Valerio Calderaro prof. arch. Eliana Cangelli arch. Angelica Fortuzzi arch. Patrizia

ELENCO PROGETTISTI AGGIORNATO FEBBRAIO 2016 … · filippi mauro arch. fioravanti giuseppe arch. fondacaro giuseppe alberto arch. forlani mauro arch. fortuna roberta arch. franchi

Kalp transplantasyonu: Ameliyat öncesi değerlendirme ... · Türk Kardiyol Dern Arş - Arch Turk Soc Cardiol 2015;43(1):95-108 doi: 10.5543/tkda.2015.27628 Kalp transplantasyonu:

Revista ARCH

Dipartimento Sviluppo Infrastrutture e Manutenzione Urbana ... · migliore laura arch. miluzzo germana arch. miranda maria luisa arch. moscatelli enrico arch. murmura lorenzo alessio

ARCH 331 Note Set 28.2 F2008abnfaculty.arch.tamu.edu/anichols/courses/architectural-structures/... · ARCH 331 Note Set 28.2 F2008abn 445 . ARCH 331 Note Set 28.2 F2008abn 446 . ARCH

Corso di Progettazione Architettonica ed Urbana 3 docente: prof. arch. Paolo Merlini assistenti: arch. Alessandro Basso arch. Filippo Maragotto arch. Giacomo