63
學界開發產業技術計畫 經濟部 Compilers and System Software for Embedded DSP Processors 畫:自940201日至970131日止 國立清華大學積體電路設計技術研發中心 國立交通大學電子資訊研發中心 李政崑教授

Compilers and System Software for Embedded DSP …mdker/courses/Seminar2008/Week03.pdf經濟部 學界開發產業技術計畫 Compilers and System Software for Embedded DSP Processors

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

  • 學界開發產業技術計畫經濟部

    Compilers and System Software for Embedded DSP Processors

    全 程 計 畫:自94年02月01日至97年01月31日止國立清華大學積體電路設計技術研發中心

    國立交通大學電子資訊研發中心

    李政崑教授

  • 2

    嵌入式系統軟體商機

    Marketing, Design, Library Control, EDA Tools and

    Silicon

    Marketing, Design, Library Control,

    EDA Tools

    Marketing, Design

    Marketing

    9.4%10,737

    74.7%85,726

    8.4%9,693

    7.5%8,644

    Late Adopter

    9.8%18,918

    45.0%86,868

    31.0%59,843

    14.2%27,412

    Late Adopter

    Upper Mainstream

    Lower Mainstream

    Power Users (Seats)

    Upper Mainstream

    Lower Mainstream

    Power Users (Seats)

    Marketing, Design, Library Control, EDA Tools and Silicon and

    SoftwareMarketing, Design,

    Library Control, EDA Tools and

    Silicon

    Marketing, Design, Library Control,

    EDA Tools

    Marketing, Design

    Marketing

    Semiconductor Design Pyramid, 2002(左圖), 2005(右圖) Source: Forecast: Embedded Software Tools, Worldwide 2004-2009, Gartner (Aug. 2005)

  • Why will software be a concern tohardware people?

  • Case Study 1: Low Power Decisions Are Made Very Early in the Design FlowCase Study 1: Low Power Decisions Are Made Very Early Case Study 1: Low Power Decisions Are Made Very Early in the Design Flowin the Design Flow

    Power Reduction Percentage

    Production

    Place & RouteClock-tree, gate-level

    RTL SynthesisClock gating, μArch, multi-Vt

    ArchitectureMulti-Voltage islands, sleep mode, etc.

    System DesignAlgorithms, IP, S/W, etc..

    Design

    Implementation

    Source: Cadence Encounter Low-Power Design Flow Seminar and Workshop

  • Case Study 2:Audio DSP Addressing

    Case Study 2:Case Study 2:Audio DSP Audio DSP AddressingAddressing

    I registers• 14-bit addressing• Contain the actual address used to access memoryM registers• 14-bit addressing• Used for post-modify schemeL registers• 14-bit addressing• Provide the modulus logic with the length information

    Memory Addressing Space• 14-bit(16k) addressing

    24-bit(16M)addressing

    • I, M, L registers NOchanges (all 14-bit size)

    • Apply two “page”registers for addressing

    DMPAGE – 10-bitPMPAGE – 10-bit

    We DO NOT know when does the carry happen even if we provide DMPAGE and PMPAGE registers for 10-bit extra addressing

    DMPAGE I,M,L register01323

  • Better Solution for Memory ExtensionBetter Solution for Better Solution for Memory ExtensionMemory Extension

    Better solution 1. Hardware automatically complete carry

    calculation for “page” registers2. No “page” registers. Instead, extend I,M,L

    registers to 24-bit or design I’, M’, L’ to be 24-bit

  • Case Study 3:Saturation Instruction

    Case Study 3:Case Study 3:Saturation InstructionSaturation Instruction

    Instruction 1, Saturation-Yes-NoInstruction 2, Saturation-Yes-NoInstruction 3, Saturation-Yes-NoInstruction 4, Saturation-Yes-No

    Saturation-onInstruction 1Instruction 2Instruction 3Instruction 4Saturation-off

  • Code-1Saturation-onInstruction 1Instruction 2Instruction 3Code-2Instruction 4Saturation-off

    Compiler Intrinsic FunctionCompiler Intrinsic FunctionCompiler Intrinsic Function

    C level functions with several assembly instructions and expanded into compiler intermediate instruction. L_SHR

    L_SHL

    L_SUB

    L_ADD

    L_MSU

    L_MAC

    ROUND

    L_MULT

    SATURE

    Code-1Saturation-onInstruction 1Instruction 2Instruction 3Instruction 4Saturation-offCode-2

    Code-1SatureCode-2

  • ESWESW重點技術項目重點技術項目平台技術與嵌入式軟

    創新應用系統及軟體

    (數位生活)數位家庭應用軟體

    娛樂、資訊、控制、服務等應用。

    個人可攜式消費性電子產品

    通訊、娛樂、資訊等行動裝置應用。

    不限定國內外嵌入式系統

    AP1

    AP2

    AP3

    AP4

    AP5

    AP6

    AP7

    TWMPUs

    DSP,Multi-DSP

    TW Platform

    Compiler RT OS

    Embedded Software

  • Star IP ProgramPerspective: To coordinate resources from the industry, universities, and research organizations in proactive development of architecture and integrated software, building upTaiwan’s industrial leadership in Silicon IP

    Basic concepts:

    CPU/DSP/MPU

    Innovative IPs for multi-band and digitalization are required for future wireless/mobile products

    Multimedia IPs

    NetworkingInterconnection/Bus/Memories core

  • StarIP Programs

    Key Techniques in Chip Systems

    Transmission Links in Chip Systems Ultra Low Power

    (ULP) DSP Core

    Messenger – Distributed Radio Transmitter System

    Low-Power Dual-Processor System

    A-Core

    SunplusS-Core

    STCPAC

  • ESW 系統及應用參考平台

    Multi-Core Micro Kernel

    GSM, 3G, WiMAX

    Multi-Core,MPU + VLIW DSP

    Ethernet, 1394

    DMA / Memory Control

    Multi-Core interconnection

    USB, HDMI

    ESL: Virtual Platform Design

    Multimedia Accelerator

    Applications

    Multi-core SoCPlatform

    System Kernel Software

    Multi-Core IPC

    Power Mgmt

    DSP Middleware

    Streaming Models

    DirectFB, OpenGL ES

    Network protocol, SIP, RTP

    MiddlewareGUI, 3D, JSR-184/239DLNAJVM, OSGi

    Offi

    ce, P

    DF

    Med

    ia

    Play

    er

    Imag

    e Vi

    ewer

    Secu

    rity

    SW

    DR

    M

    GPS

    Nav

    .

    Mai

    l

    Web

    B

    row

    ser

    VOIP

    , IM

    Java

    AP,

    Jav

    a 3D

    P2P Stream Server

    Web Service

    Web Container

    Multimedia Portal

    Digital Camera

    Wireless

    Transcoding Biometrics Crypto Engine

    Graphics AV CODECs Baseband

    Embedded Linux + RTOS

    Multi-Core IDE and Debugger

    Multi-Core Compiler Toolkits

    APIs

  • 學界開發產業技術計畫經濟部

    計畫架構

    前瞻高效能低耗能之雙處理器系統技術研發計畫

    權重: 100%

    A. 系統軟體研發與設計(李政崑教授,清大資工)

    C.作業系統與應用程式之研發(林大衛教授,交大電子)

    B.平行訊號處理器晶片開發與電路設計(黃威教授,交大電子)

    (人力、經費)第一年度 29.20% 33.26%第二年度 29.20% 33.26%第三年度 29.20% 33.26%

    (人力、經費)第一年度 37.37% 33.17%第二年度 37.37% 33.17%第三年度 37.37% 33.17%

    (人力、經費)第一年度 26.78% 16.42%第二年度 26.78% 16.42%第三年度 26.78% 16.42%

    1.超長指令集數位訊號處理器編譯器之設計與研發(李政崑教授,清大資工)(黃元欣教授,海大資工)

    2. 數位訊號處理器相關函式庫之研發及系統效能評估(石維寬教授,清大資工)(李政崑教授,清大資工)

    1. 物件導向執行環境(陳俊穎教授,交大資科)

    2. Low Power Logic and Memory Units(黃威教授,交大電子)

    1. Extensible and Energy-Aware Micro-architecture(劉志尉教授,交大電子)

    3. Interface Circuit Design and Bit- CoprocessingAccelerator(張添烜教授,交大電子)

    2. 多媒體演算法之系統分析及設計研發(林大衛教授,交大電子)(賴尚宏教授,清大資工)

    4. 雙處理器開發工具整合與測試之系統規劃(李政崑教授, 清大資工)

    3. 雙處理器系統展示平台之開發與規劃(分包工研院STC)

    (人力、經費)第一年度 6.66% 17.2%第二年度 6.66% 17.2%第三年度 6.66% 17.2%

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    94/2 5 8 11 95/2 5 8 11 96/2 5 8 11

    全程計畫產出時程表全程計畫產出時程表計畫產出

    Debugger(2.0)

    Compiler(3.0)

    PACDSP 2.0 PACDSP 3.0

    Debugger(3.0)

    Microkernel(3.0)

    Assembler/linker(2.1)

    Assembler/linker(3.0)SIMD/sub-word/GRA

    NewLib

    Dual-core Programming model

    Compiler Optimizations

    Microkernel

    Compiler Toolchain

    Multi-mode On-demand SRAMAsynchronous FIFO in GALS

    Microkernel(2.0)

    Embedded DRAM

    Architecture

    Low-Power

    Pica 2.0 Pica 3.0 Multi-mode SARM for Virtual Cluster DSP

    Multicore DSP Interface

    KVM on PAC

    GALS-based FFTMultiple Supply Voltage System

    Frame-based MPEG4 decoder

    JVM with JIT compiler on PAC

    Efficient Intra-Prediction

    Fast-Intermode Decision(static Learning)

    Simple ME Object-based MPEG4 decoderFast Intra-Prediction

    Intra-Prediction on PACFast Motion Estimation

    Java Environment

    Video CodingObject-based MPEG4 encoder

    Frame-based MPEG4 encoder

    Multi-LevelPower Management

    Low Power TCAM(0.26fj/bit/Search) Controllable Dual-PortLow Power SRAM

    DFVM for Energy-Aware FFT

    Streaming Programming model

    GRA/SIMD Compilers

    NewLib on PMP SoC

    2

    ESL Platform

    Toolkits for CIC

    Controllable SRAMfor Virtual Cluster DSP

    Multi-threaded DSP Interface

    CVM/KVM/MIDP 2.0 BenchmarkingAOT Compilation Prototype

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發 15

    2007.3Enable PAC DSP System Newlib

    2007.3Enable PAC DSP System Newlib

    PAC3.0

    PAC3.0

    PAC2.1

    PAC2.12005.11

    PAC DSP 2.1 Assembler/Linker Available2005.11

    PAC DSP 2.1 Assembler/Linker Available 2005.12PAC DSP 3.0 Document Available2005.12PAC DSP 3.0 Document Available

    2005.9PAC DSP 2.1 Document Available2005.9PAC DSP 2.1 Document Available

    2005.9PAC DSP 2.0 Debugger Available

    2005.9PAC DSP 2.0 Debugger Available

    2005.10PAC DSP 2.0 Microkernel Available

    2005.10PAC DSP 2.0 Microkernel Available

    PAC2.0

    PAC2.0

    PAC3.1

    PAC3.1

    2006.2 PAC DSP 3.0 ISS Available2006.2 PAC DSP 3.0 ISS Available2006.3PAC DSP 2.0 Real Chip Demo2006.3PAC DSP 2.0 Real Chip Demo2006.4

    PAC DSP 3.0 Compiler Available2006.4

    PAC DSP 3.0 Compiler Available

    2006.5PAC DSP 3.0 RTL Freeze / Tape-out2006.5PAC DSP 3.0 RTL Freeze / Tape-out

    2006.7PAC DSP 3.0 Microkernel Available

    2006.7PAC DSP 3.0 Microkernel Available

    2006.10PAC DSP 3.0 Real Chip Testing Passed2006.10PAC DSP 3.0 Real Chip Testing Passed

    2006.2PAC DSP 3.0 Assembler/Linker Available

    2006.2PAC DSP 3.0 Assembler/Linker Available

    2006.3PAC DSP 3.0 Debugger Available

    2006.3PAC DSP 3.0 Debugger Available

    2006.5PAC DSP 3.0 Software/Hardware Testing

    Criteria Passed

    2006.5PAC DSP 3.0 Software/Hardware Testing

    Criteria Passed

    2007.1PAC DSP 3.0/3.1 Optimizing Compiler Available

    2007.1PAC DSP 3.0/3.1 Optimizing Compiler Available

    2007.1PAC PMP SOC Tape-outARM+DSP H.264 VGA Realtime Demo Passed

    2007.1PAC PMP SOC Tape-outARM+DSP H.264 VGA Realtime Demo Passed

    2007.4SIMD with Sub-word Instructions/Global RFA

    2007.4SIMD with Sub-word Instructions/Global RFA

    2007.6Programming Model Design for PAC II

    2007.6Programming Model Design for PAC II

    Tool Chain: Compiler & MicroTool Chain: Compiler & Micro--KernelKernel

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發 16

    2007.3Enable PAC DSP System Newlib

    2007.3Enable PAC DSP System Newlib

    PAC3.1

    PAC3.1

    2007.1PAC DSP 3.0/3.1 Optimizing Compiler Available

    2007.1PAC DSP 3.0/3.1 Optimizing Compiler Available

    2007.1PAC PMP SOC Tape-outARM+DSP H.264 VGA Realtime Demo Passed

    2007.1PAC PMP SOC Tape-outARM+DSP H.264 VGA Realtime Demo Passed

    2007.4SIMD with Sub-word Instructions/Global RFA

    2007.4SIMD with Sub-word Instructions/Global RFA

    2007.6Streaming RPC Programming Model Design for PAC II

    2007.6Streaming RPC Programming Model Design for PAC II

    Tool Chain: Compiler & MicroTool Chain: Compiler & Micro--Kernel Kernel (Continue(Continue--2008)2008)

    2008. 01PAC DSP Toolkits Delivered for CIC

    2008. 01PAC DSP Toolkits Delivered for CIC

    2007.11PAC DSP Debugger on PAC PMP SoC Board

    2007.11PAC DSP Debugger on PAC PMP SoC Board

    2007.10PAC PMP SoC Board Available2007.10PAC PMP SoC Board Available

    2007.9Pass EEMBC Telecom Test-suite

    2007.9Pass EEMBC Telecom Test-suite

    2007.10Pass Major MiBench Test-suite

    2007.10Pass Major MiBench Test-suite

    2008. 02Enable PAC DSP newlib support PAC PMP SoC Board

    2008. 02Enable PAC DSP newlib support PAC PMP SoC Board

    2008.3PACDSP+AndeSCore Dual-Core ESL platform

    2008.3PACDSP+AndeSCore Dual-Core ESL platform

    2007.8Communication APIs

    2007.8Communication APIs

    2008. 03Multi-core communication API runtime

    2008. 03Multi-core communication API runtime

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    Development Software

    Middleware

    Dual-Core OS/Architecture/ESL

    科專產出科專產出: : 整合多媒體雙核心應用平台整合多媒體雙核心應用平台

    AV.CodecsJava VM

    pCoreLinux

    RPC

    Streaming PRC

    DSP Lib.

    OpenMAX IL

    ESL: Virtual PlatformMPU+PACDSP

    OpenMAX DLCompiler Toolkits

    ApplicationGUI

    SIP Phone Image ViewerMedia PlayerJava AP

    Web Browser

    透過網路達成VoIP的電話功能

    提供PACDSP H.264及MPEG4 Codecs

    提供網路瀏覽應用

    相片瀏覽功能

    影像播放功能

    提供雙核心彼此間的呼叫與資料傳遞

    提供NewLib,作為C語言函式庫

    提供完整的軟體開發環境(IDE, Compiler,

    Debugger, Assembler, Binutilus, …)

    提供Linux 2.6的作業系統環境

    雙核心高效能硬體平台(A3, STC技術支援)

    系統層級的模擬環境(衍生計畫產出)

    提供OpenMax IL層的多媒體呼叫,以及

    OpenMax DL層的函式庫設計

    可執行Java應用程式,提供Java VM

    提供Streaming方式的雙核心RPC資料傳遞

    藉由pCore(DSP微核心),提供多重函式呼

    叫與工作切換

  • PACDSP Kernel

    From ITRI STC

  • Clustered Architecture

    Scalable Clustered ArchitectureScalable computation power fits the different application requirement

    Distributed Register FileDistributed register file reduce the area and power consumption

    Cluster From ITRI STC

  • Distributed Register File- A, PP,C and AC Register File

    Ping-pong Register FileReduce Move OperationReduce the port number in each register file

    Constant Register FileReduce Load/Store OperationReduce the power consumption in coefficient access

    From ITRI STC

  • Clustered Data Path and Distributed Register File- Evaluation Result

    Compare with Centralized Register File

    Area : 76.8% area are saved Access Time : 46.9% access time are saved

    From ITRI STC

  • DSP Compilers

    • Heterogeneous Registers• Addressing Modes• Limited Connectivities• Local Memories• Harvard Architecture• Sub-word/SIMD Performances

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    Key Compiler TechnologiesVLIW DSP compilers for distributed Register FilesPALF scheduling policies for ILP (CPC 2006)SA-based scheme for ILP (LCPC 2005)GRA scheme for distributed register files (CPC 2007)Copy propagations for distributed register architectures (LCPC 2006)SWP flow for distributed register file architectures (ACM LCTES 2007)SIMD compiler optimizationsEnable compiler + intrinsics/extrinsicsCompiler for low-power schemes (ACM EMSOFT 2005/ACM TODAES 2007)

  • Irregular Register File Usage

    Local RFAccessible Only by Dedicated FU

    Global RFShared by M-I PairAccess Limited byPing-pong Switches

    Constant RFWritable only by M-Unit1 Read-only Access Per UnitOnly Usable by Some Instructions

    M-Unit

    A Registers

    M-Unit

    A Registers

    I-Unit

    AC Registers

    I-Unit

    AC Registers

    B-Unit

    R Registers

    M-Unit M-Unit

    I-Unit I-Unit

    D Registers D Registers

    M-Unit M-Unit

    I-Unit I-Unit

    CRegisters

    CRegisters

    Maximal 2 Read Ports + 1 Write PortLow cost and low power

    (Compared to TI C6: 10 Read Ports + 6 Write Ports)

  • 1. Register Allocation for PAC DSP

    How to adapt register allocation to the features of irregular register files?

    Various RFs for each FUPing-pong Bank SwitchesComplex Register File Communication

    Our proposed solutions:SA-RA: A simulated-annealing approach (LCPC 2005)

    Random based nondeterministic iterative searchPALF-RA: A more direct and faster heuristic approach

    PALF Register Allocation Scheme (CPC 2006)

  • Solution 1:Simulated-Annealing (SA) Based Register Allocation Approach

    Motivation:Complex interference from:

    We appreciate a machine-learning method to give a near-optimal results.To be a base reference for developing a faster heuristic method!

    Register Allocation

    InstructionScheduling

    Code Insertionfor Distributed Register

    Communication

  • To Determine: Virtual Register Register File (Bank)

    Input: un-scheduled instructionsOutput: a schedule of the instructions

    a register file assignment (RFA) map

    RFA map = {(v1, f1), (v2, f2), ...} Where vi : a virtual register, fi : a register file (bank)

    • Setup SA:1. An initial random RFA map2. sched_len = PAC_Scheduler ( initial RFA map )3. SA control variables:

    • threshold• p_test: a probability test value (0 < p_test < 1).• energy: initial value > threshold.

    PAC_Scheduler:Graph-coloring based register allocation according to the RFA mapInstruction scheduling and code insertion for register file communication

  • To Optimize: Scheduling Result

    Randomly change:a mapping (vi, fi)

    Re-run:new_schedule_len =

    PAC_Scheduler (new RFA map)

    new RFA map

    Better result test:new_schedule_len < schedule_len

    energy--schedule_len =

    new_schedule_len

    Random test:a random number > p_test

    energy++

    yes

    yes

    no

    no

    new RF

    A map

    old RFA map

    SA stop test:energy > threshold

    yes

    FinalRFA map

    &schedule

    no

  • Solution 2:Concepts for PALF Register Allocation Scheme

    Which RF to be allocated?

    How to utilize ping-pong banks?

    How to utilize clusters?

    LessUsability

    More Scheduling Interference

    LocalRF

    ConstantRF

    GlobalRF

    HigherPriority

    To maximize parallelism between banks! Hard to determine without scheduling

    If the results are not worse than using global RF,why not use local RF instead?

    M-Unit I-UnitOnly allocate Global RF for:

    To partition operations into two clusters if it worth!

  • PALF Register Allocation Scheme

    MaximalLocalization

    Register FileAssignment

    Ping-pongRegister BankAssignment

    ClusterAssignment

    CommunicationCode Insertion

    Post-passRegister

    Allocation

    BuildCRTA-DDG

    2-Cluster Code?Yes

    No

    Ping-pong Aware Local Favorable (PALF)To allocate local RF firstTo assign ping-pong banks for minimizing interference

    PALF determines RF allocation, then applies usual register allocation to each RF

  • Experiment Platform

    EnvironmentPACDSP Compiler (using ORC infrastructure)PACDSP binutils(modified GNU binutils)Instruction Set Simulator

    Cycle accurate

    BenchmarkDSP-stone

    System-SoftwareDevelopment Suite

    Profiler

    Debugger

    Libraries

    Assembler Linker

    C/C++ Compiler

    InstructionSet

    Simulator

  • DSPstone Testing Patterns -Speedup

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    2. Enable GRA (Global Register File Assignment) 2. Enable GRA (Global Register File Assignment) for PAC Architectures for PAC Architectures

    •• Extends Local (single block) RFA to Global Extends Local (single block) RFA to Global RFARFA

    •• Block prioritization for Global RFABlock prioritization for Global RFA–– Based on loop depth, scheduling length, Based on loop depth, scheduling length,

    frequency (static or profiled), etc.frequency (static or profiled), etc.–– Prioritizing more Prioritizing more ““importantimportant”” blocks that have blocks that have

    greater effects on performancegreater effects on performance–– Try to minimize modification to the Try to minimize modification to the RFAsRFAs of the of the

    blocks with higher priority. That is, keep them blocks with higher priority. That is, keep them untouched as possibleuntouched as possible

    •• A LocalA Local--Conscious Global Register Conscious Global Register AllocatorAllocator for VLIW for VLIW DSP Processors with Distributed Register Files, DSP Processors with Distributed Register Files, ChiaChia Han Han Lu, YoungLu, Young--JiaJia Lin, YiLin, Yi--Ping You, Ping You, JenqJenq--KuenKuen Lee, CPC Lee, CPC 2007 (Compilers for Parallel Computing 2007), Lisbon, 2007 (Compilers for Parallel Computing 2007), Lisbon, Portugal, July 9Portugal, July 9--11, 200711, 2007

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    Example of Global RFAExample of Global RFA

    BB1

    BB2

    BB4

    BB3

    BB5 BB6

    BB7

    BB8

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    Experimental Result of Global RFAExperimental Result of Global RFA

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    PACDSP SubPACDSP Sub--word Instructionsword Instructions

    • PAC DSP Sub-word instructions– Three execution modes

    • Single: ADD Rd, Rs1, Rs2 (one 32-bit)• Dual: ADD.D Rd, Rs1, Rs2 (two 16-bit)• Quad: ADD.Q Rd, Rs1, Rs2 (four 8-bit)

    – Sub-word calculation (3 modes)

    32 bit32 bit

    32 bit32 bit

    32 bit32 bit

    +

    =

    16 bit16 bit+

    =

    16 bit16 bit

    16 bit16 bit 16 bit16 bit

    16 bit16 bit 16 bit16 bit

    88 88 88 88++

    =88 88 88 88

    + + +

    = = = =

    88 88 88 88Single Dual Quad

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    Performance for SIMD on PAC CompilerPerformance for SIMD on PAC Compiler

    0

    0.5

    1

    1.5

    2

    2.5

    bkfir lms ssfir vecadd vecdot

    Spe

    edup

    O0+ILO0+SIMDO1+ILO1+SIMDO2+ILO2+SIMD

    0

    2

    4

    6

    8

    10

    12

    14

    16

    18

    bkfir lms ssfir vecadd vecdot

    Spee

    dup O0

    O1+ILO2+ILO2+SIMD

    PS. IL means IPA and LNO optimization

    Fig. Improvem

    ent for each optim

    ization levelFig. O

    verall Performance

    Dspstone performance

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    Compiler Performance with G.723.1a Application

    G.723.1a Decoder

    0

    2

    4

    6

    8

    10

    12

    TI O0 TI O1 TI O2 TI O3 PACC O0 PACC O1 PACC O2 PACCO2+IPA+WOPT

    PACC withExtrinsic

    Cyc

    le N

    orm

    aliz

    ed to

    TI O

    3

    G.723.1a Encoder

    0

    2

    4

    6

    8

    10

    TI O0 TI O1 TI O2 TI O3 PACC O0 PACC O1 PACC O2 PACCO2+IPA+WOPT

    PACC withExtrinsic

    Cyc

    le N

    orm

    aliz

    ed to

    TI O

    3

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    Compiler with SIMD Intrinsic/Extrinsic Support

    Extrinsic

    Register pair (64-bit)SWAP4E

    Fix-pointSWAP4

    SaturatedUNPACK4(U)

    Intrinsic supportedRORPACK4

    ROLUNPACK2(U)

    INSERTISWAP2

    L_SHRINSERTPERMH4

    L_SHLXDOTP2EXTRACTI(U)ADDC(U)PERMH2

    L_SUBSAA.QDOTP2EXTRACT(U)ABS.(D/Q)LIMB(U)CP

    L_ADDRNDXMSU.DSRAI.(D)NEGLIMHW(U)CP

    L_MSUBF.(D)XMAC.DSRA.(D)MERGESLIMW(U)CP

    L_MACSFRA.(

    D)XFMAC.(D)XMUL.DSRLI.(D)MERGEACOPY

    ROUNDLMBDMSU.DXFMUL.(D)SRL.(D)SUB(U).(D/Q)MOVI(U).H

    L_MULTCLSMAC.DMUL.DSLLI.(D)ADDI.(D)(D)MAX(U).(D/Q)MOVI(U)

    SATURE(D)CLRFMAC(uu/us/su).(D)FMUL(uu/su/us).(D)SLL.(D)ADD(U).(D/Q)(D)MIN(U).(D/Q)MOVI.L

    ExtrinsicSpecialMultiply & AddMultiplicationBit

    ManipulationArithmeticComparisonData

    Transfer

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    MiBenchMiBench Performance for PAC CompilerPerformance for PAC Compiler

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    EEMBC PerformanceEEMBC Performance

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    EEMBC PerformanceEEMBC Performance

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    Compiler for Low-Power with Power Gating Instruction

    0%

    5%

    10%

    15%

    20%

    25%

    30%

    35%

    Code

    Siz

    e G

    rout

    h

    complex_m

    ultiply

    complex_u

    pdateconv

    olution

    dot_produc

    tfir2di

    m fir

    irr_biquad

    _N_sectio

    ns

    irr_biquad

    _one_sec

    tion lmsmatri

    x 1x3 matri

    x

    n_comple

    x_updates

    n_real_up

    datesreal_

    updateavera

    ge

    CADFE CADFE w/Sink-N-Hoist

    To turn off useless components in processors. Use compiler analysis techniques to analyze program behaviorsCompilers insert power-gating instructions and try to merge those instructions.

    ( )

    ( )

    ( )

    ( )

    ( ) ( )

    ( ) ( ) ( ) ( )

    ( ) ( )

    ( ) ( ) ( ) ( )

    in outp Pred b

    out loc in blk

    in outp Pred b

    out loc in blk

    b p

    b b b b

    b p

    b b b b

    =

    = ∪ −

    =

    = ∪ −

    U

    I

    COMPONENT COMPONENT

    COMPONENT COMPONENT COMPONENT COMPONENT

    SINKABLE SINKABLE

    SINKABLE SINKABLE SINKABLE SINKABLE

    HO

    ( )

    ( ){ }( )

    ( )

    ( )

    ( )

    ( ) ( )

    ( ) ( ) ( ) ( )

    ( ( ))( )

    , if ( ( ))

    in outs Succ b

    out loc in blk

    p Pred b outin

    p Pred b out

    out

    b s

    b b b b

    pb

    p

    =

    = ∪ −

    ⎧ Φ⎪=⎨∅ Φ =∞⎪⎩

    IISTABLE HOISTABLE

    HOISTABLE HOISTABLE HOISTABLE HOISTABLE

    GROUP OFFGROUP OFF

    GROUP OFF

    GROUP OFF

    --

    -

    -

    MIN

    MIN

    ( )( ){ }

    ( )( )

    ( )

    ( )

    ( ) ( ) ( ) ( )

    ( ( ))( )

    , if ( ( ))

    ( ) ( ) ( ) ( )

    loc in blk

    p Pred b outin

    p Pred b out

    out loc in blk

    b b b b

    pb

    p

    b b b b

    = ∪ −

    ⎧ Φ⎪=⎨∅ Φ =∞⎪⎩

    = ∪ −

    GROUP OFF GROUP OFF GROUP OFF

    GROUP ONGROUP ON

    GROUP ON

    GROUP ON GROUP ON GROUP ON GROUP ON

    - - -

    --

    -

    - - - -

    MIN

    MIN

    Based on our work inACM TODAES 2006 & ACM TODAES 2007

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    DualDual--Core/MultiCore/Multi--Core Compiler Core Compiler with Language Supportwith Language Support

    Fully control of dedicated Fully control of dedicated processors(sayprocessors(say, PAC DSP), PAC DSP)

    Boot & initialize DSPBoot & initialize DSPBased on Based on pCorepCore, multi, multi--tasking tasking environmentenvironmentHighly efficient data exchange Highly efficient data exchange between ARM and DSPbetween ARM and DSP

    Three layers frameworkThree layers frameworkLayer 0: Communication Layer 0: Communication LibraryLibrary

    APIs for C programsAPIs for C programsConventional C library Conventional C library interfacesinterfaces

    Layer 1: Layer 1: RemotingRemoting/RPC/RPCMinor CMinor C--extension languageextension languageEndEnd--toto--end service end service

    Layer 2: Data flow control typeLayer 2: Data flow control type

    44

    • Software Architecture Design for Streaming Java RMI, C. C. Yang, Jenq Kuen Lee, et al, accepted, Science of Computer Programming.

    • Streaming Support for Java RMI in Distributed Environment, C. C. Yang, Jenq-Kuen Lee, et al, ACM Principles and Practices of Programming In Java (PPPJ 2006), 2006.

    • Efficient Switching Supports of Distributed .NET Remoting with Network Processors, Jenq-Kuen Lee, et al, ICPP 2005.

    • Support and optimization of Java RMI over Bluetooth environments, P. C. Wey, JenqKuen Lee, et al, Concurrency and Computation:Practice and Experience, 2005;17:967-989, Wiley, 2005.

    • Efficient Support of Java RMI over Heterogeneous Wireless Networks, Jenq-KuenLee, et al, ICC, Paris, June 2004.

    • Specification and Architecture Supports for Component Adaptations on Distributed Environments, Chung-Kai Chen, Jenq KuenLee, et al, IPDPS 2004.

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    RPCRPC

    Application

    Communication runtimeRemote services invocation

    Streaming servicesStreaming services

    Operating systems pCorepCore

    Architecture supported communication mechanisms

    client server

    VICVIC MailboxMailbox Shared Shared MemoryMemory DMADMA

    Communication runtime

    RegistryRegistryRegistry

    Communication runtime

    Model the communication as end-to-end services Server: the remote processClient: the task that invokes a remote process

    Remote invocation is like a local function invocationClient and server communicates through communicator The communication information is stored in a shared data structure called registry residentServices are off-loaded to DSPs by service processors

  • DDualual--Core Programming with Core Programming with Streaming RPCStreaming RPC

    Key componentsKey componentsStreaming line: associated to an RPC request, provides a Streaming line: associated to an RPC request, provides a communication channel between the sender and the receivercommunication channel between the sender and the receiverStreaming buffer: associated to a streaming line for providing dStreaming buffer: associated to a streaming line for providing data ata bufferingbufferingStream controller: monitoring and managing the streaming buffersStream controller: monitoring and managing the streaming buffersfor supporting datafor supporting data--driven and optimization features driven and optimization features

    Application

    Communication runtime

    Remote services invocationStreaming servicesStreaming services

    Operating systems

    pCorepCore

    Architecture supported communication mechanisms

    client server

    VICVIC MailboxMailbox Shared Shared MemoryMemory DMADMA

    Communication runtime

    Streaming lineStreaming lineStreaming buffer

    Stream ControllerStream Stream ControllerController

  • Demo Environment :PAC EVBARM Versatile Platform PB926

    Platform 64MB Intel NOR Flash128MB 32-bit SDRAMEthernetVGA monitor output2 AHB-m , 1 AHB-s bridge

    ARM 926EJ-S300M Hz0.18μ32KB cache

    SoftwareLinux 2.6.17IPC Kernel Module

    EVB (PAC DSP)Platform

    512MB SDRAM1MB ARAM *2 , 1MB block RAM *2Audio out, video out

    PAC DSP 3.0250 MHZ64KB data ram32KB instruction cache

    SoftwarepCore v3.0IPC library

    ARM VersatileARM Versatile

    EVBEVB

  • pCore - PACDSP™ 微核心Kernel Feature

    Static memory allocationpCore supports configurable kernel

    Task ManagementpCore supports at most 16 tasksPriority based schedulerConstant time scheduler

    Inter-Process Communication

    pCore supports at most 4 mailboxes

    SynchronizationpCore supports at most 4 semaphores

    Dual-Core ProgrammingpCore supports APIs to help programmer to communicate with MPU(ARM).

    2008/06/24 48

    -41Pipe write

    -41Pipe read

    -21Pipe initialization

    234284Mailbox pend (with context switching)

    -70Mailbox pend ( no context switching)

    218268Mailbox post (with context switching)

    -54Mailbox post( no context switching)

    220270Semaphore pend (with context switching)

    -56Semaphore pend ( no context switching)

    220270Semaphore post (with context switching)

    -56Semaphore post( no context switching)

    198249Task change priority

    205254Task suspend (with context switching)

    -91Task suspend ( no context switching)

    -113Task delete

    280321Task create (with context switching )

    -147Task create (no context switching )

    With MCSS*

    No MCSS

    Kernel service /cycle counts

    * K.-Y. Hsieh, Y.-C. Lin, and J. K. Lee, “Enhancing microkernel performance on VLIW DSP processors via multiset context switch,” Journal of VLSI Signal Processing Systems, 2007.

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    ApplicationsApplicationswith Streaming RPCwith Streaming RPC

    FeaturesFeaturesSupport RPCSupport RPC--level level abstractionabstractionDesign patterns for Design patterns for overlapping overlapping communication and communication and computationcomputationThreshold Threshold parameter for parameter for reducing amount of reducing amount of handhand--shakingshakingSupport Linux kernel Support Linux kernel 2.6.x 2.6.x Support two platformSupport two platform

    PACPAC--EVBEVBTI OMAP 5912TI OMAP 5912

    / *Streaming RPC server* /void imdct ( ){

    STREAM_ID id = 4 ;/* Initializing streaming channel* /stream_create( id ) ;/* Aggregating

    from streaming channel*/stream_get( id , DATA) ;steram_pop( id ) ;…}

    / *Streaming RPC server* /void imdct ( ){

    STREAM_ID id = 4 ;/* Initializing streaming channel* /stream_create( id ) ;/* Aggregating

    from streaming channel*/stream_get( id , DATA) ;steram_pop( id ) ;…}

    Example: Sample code of the streaming RPC implementation of an MP3 decoder

    /* Streaming RPC client*/void MP3 decoder ( ){stream_rpc( imdct , transmitter) ;

    }void transmitter( ){

    STREAM_ID id = 4 ;/* Initializing streaming channel* / stream_create( id ) ;/* Pushing data

    to streaming channel */stream_put( id , DATA) ;stream_push ( id ) ;…

    }

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    Performance EvaluationPerformance EvaluationApplicationsApplications

    JPEG decoder, resolution JPEG decoder, resolution of 317*255of 317*255MP3 decoder, 128k bitMP3 decoder, 128k bit--rate, 44100 sample raterate, 44100 sample rateQCIF H.264 decoderQCIF H.264 decoder

    Streaming RPC improves Streaming RPC improves the average performance the average performance of the decoder by of the decoder by 30%30%

    Application kernelsApplication kernelsIDCT of JPEG decoderIDCT of JPEG decoderIMDCT of MP3 decoderIMDCT of MP3 decoderIT/IQ of H.264 decoderIT/IQ of H.264 decoder

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    Dual-Core Toolkits整合DemoDual-core Multimedia Mobile Phone

    Three multimedia decoder : JPEG, MP3, H.264One Simulated telecom decoding program (SIP)

    Dual-OS environment with communication libraryMultiple tasks on DSP scheduled by pCore with priority-based policyWhile user making a phone call:

    PAC DSP starts processing telecom decoding with highest priorityMedia decoding task is suspended and resumed at the end of phone calls

    media thread

    phone thread

    media task

    phone task

    idle taskprogram startinitialize tasks Start media make a phone call media contiunes

    AR

    MD

    SP

  • pllab.cs.nthu.edu.twpllab.cs.nthu.edu.twpllab.cs.nthu.edu.tw

    經濟部學界開發產業技術計畫

    雙核心嵌入式處理器架構技術虛擬平台-An Design Experience on

    AndESLive

    總計畫主持人 李政崑教授國立清華大學積體電路設計技術研發中心

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發Binutils

    Compiler

    GDB

    Libraries

    PACDSP + Andes Toolchain

    Andes+PAC:雙核心ESL模擬環境

    完成工作完成工作

    ESLESL模擬:模擬:完成完成PACDSP v3.0 PACDSP v3.0 sidsid IPIPPACDSP IPPACDSP IP與與AndESLiveAndESLive IPIP整合整合Share memoryShare memory及及interruptinterrupt溝通溝通

    開發工具:開發工具:

    PACDSP Compiler PACDSP Compiler PACDSP PACDSP BinutilsBinutils

    進行工作進行工作

    作業系統:作業系統:

    Dual core OS: Linux 2.6 Dual core OS: Linux 2.6 kernel(Andeskernel(Andes) and PACDSP ) and PACDSP pCore(NTHUpCore(NTHU) )

  • pllab.cs.nthu.edu.twpllab.cs.nthu.edu.tw

    AndesSCore™ + PACDSP Co-Simulation

  • pllab.cs.nthu.edu.twpllab.cs.nthu.edu.twDual Core – JPEG Demo

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    獲獎紀錄獲獎紀錄

    李政崑教授PACDSP Compiler教育部嵌入式軟體競賽優勝2008/7

    劉志尉教授The third place award for the Soft-IPUn-assign Topic in The Silicon Intellectual Property (SIP) Design Contest

    教育部 SIP競賽 不定題組 佳作2006

    項目 獲獎題目 指導老師

    2006/1 ASPDAC Outstanding Design Award 52mW 1200 MIPS Compact DSP for Multi-core Media SoC

    劉志尉教授

    2006/11 Workshop on Consumer Electronics and Signal Processing 學生論文獎

    Software implementation of MPEG-4 video decoder on the PACDSP platform

    林大衛教授

    2006/12 中華民國資訊學會碩博士最佳論文獎碩士論文獎佳作

    降低漏電電流暨增進效能之集合關聯式快取記憶體李嘉哲

    黃元欣教授

    2006/12 中華民國資訊學會碩博士最佳論文獎碩士論文獎佳作

    疊加網路上軟體元件遠端呼叫之資料串流支援楊智傑

    李政崑教授

    2007/7 旺宏金矽獎優勝 雙核心系統整合開發工具 李政崑教授

    2007/7 教育部嵌入式軟體競賽優勝 PACDSP BIOS之設計與驗證 李政崑教授

    2007/11 National Symposium on Telecommunications學生論文優等獎

    Software implementation of MPEG-4 object-based video decoder on a dual-core digital signal processor platform

    林大衛教授

    2007/12/22 微電腦應用系統設計製作競賽 - 嵌入式軟體系統類 研究所組 第三名

    Fast Motion Estimation in H.264 for PAC DSP 賴尚宏教授

    2008/1 中華民國資訊學會碩博士最佳論文獎博士論文獎優等

    低功率嵌入式處理器之編譯器最佳化研究游逸平

    李政崑教授

    2008/1 中華民國資訊學會碩博士最佳論文獎碩士論文獎優等

    支援數位訊號處理器之微核心設計及雙核心開發環境黃建今

    李政崑教授

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    CIC Distribution:PACDSP處理器開發工具整合套件

    與STC聯合提供國家晶片系統設計中心(CIC) PACDSP 處理器開發工具整合套件

    簽約日期:97.03.10簽約金額:2,000,000(15% income for NTHU MOEA Team)

    項目DSP tool chain on PMP EVBC-library for PMP EVBIntegrated development environmentPMP Spec /課程大綱Training course /LAB軟體使用環境說明書週邊硬體使用者手冊DSP programmingPMP EVB overview

  • 前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發前瞻高效能低耗能之雙處理器系統技術研發

    SunPlus SPCT6100

  • ESW平台開發機制

    Platform技術處

    Education教育部

    Research國科會

    Development工業局 基於台灣自有之處理

    器核心建構自有平台與軟體之核心技術

    培育自有平台與軟體之應用與研發人才

    推動平台與軟體商品化之多元應用

    透過開放軟體延伸平台之多元應用並研發性能提升之前瞻技術

  • 國科會自由軟體與嵌入式系統學術研發應用科技計畫

    計畫推動策略重視研發品質管理

    軟體品質管理流程研討、(需求分析、系統設計、系統測試)三階段文件評鑑、成果公開評鑑

    加強法人研究機構、公協會及產業界的參與及合作-黃金企鵝獎

    強化學術團隊與自由軟體社群的交流

    多元方式獎勵成果推廣與績優團隊

    提供訓練課程、技術支援與諮詢

    鼓勵使用嵌入式平台進行發展、研究以及嵌入式系統競賽之系統發展

  • 推動共用平台開發產業聯盟(TEIA)

    成立目的:推廣共用平台至「產、學、研」單位不定期舉行開發者技術分享研討會提供平台開發者交流網站

    成立步驟邀請共用平台廠商制定聯盟章程廣邀國內嵌入式軟硬體廠商加入

    規模負責聯盟事務

  • 推動共用平台開發者產業鏈(建立eco-system)

    系統整合開發

    嵌入式軟體

    嵌入式硬體IC設計

    IC設計服務SIP供應商

    嵌入式OS/軟體

    設計服務

    工業控制

    網路

    醫療

    消費性電子

    PC/PC周邊手機

    汽車

    國內知名廠商 國外知名廠商

    威盛、凌陽、瑞昱、聯發科, etc

    創意, 智原

    晶心

    智崴,系微, 滾雷

    環隆電氣, 新華

    威達電, 研華科技

    合勤,零壹, 友訊

    Agilent

    MSI, Gigabyte

    Acer, Asus

    宏達電,啟基

    裕隆, 中華

    Qualcomm, Boradcom

    Wipro

    ARM, MIPS

    Microsoft, Wind River, VenturCom, Quadros, QNX Software Sys,

    Palm

    Flextronics, Solectron

    Kontron

    3 com, CISCO

    Omron

    Sony, Apple

    HP, DELL

    Nokia, Motorola

    Toyota, Benz, BMW

  • SummarySystem software will be playing significant roles with IC industries in Taiwan.Dual-core and multi-core solutions are now the norm in embedded systems.Introduction to STAR Processor IPs.We present an enabled flow for performing PAC Compiler and Toolkits.Our compiler could generate efficient codes for a set of DSP loop kernels