Optimization On TI DSP System Yu-Chang Huang( 黃育彰 ) 2006/01/05 Dept. Electronics Engineering, National Chiao Tung University

Optimization On TI DSP System

Yu-Chang Huang(黃育彰 )

2006/01/05

Dep

t. Ele

ctron

ics E

ng

ineerin

g, N

atio

nal C

hia

o T

ung

Un

iversity

2

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Outline

• Reference

• TI DSP Platform

• Optimization On DSP Platform

• Conclusion

3

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Reference

• Texas Instrument -- http://www.ti.com/

• Texas Instrument Reference Guide, ” TMS320C6000 CPU and Instruction Set”, 2000

• http://www.sundance.com/

• 吳俊榮 , “Introdection to Sundance DSP-Development System”, Aug. 3 ,2005

• 王盈閔 , “MPEG-4 AAC Codec Acceleration and DSP implementation”,June,2004

• 旺陽電企業有限公司 -- http://www.vpdsp.com/

• 王逸如 , 陳信宏 , “ 數位訊號處理的新利器 TMS320 C6X” 修訂版 , 全華出版社 ,2004 年 3 月

4

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

DSP Based Systems

• Software-Oriented Embedded System– 80% of the development effort is software– Accelerate development time– System flexibility

• Real-Time Processing System– Real time implementation

• Event-Driven multi-tasking System– Facilitate Resource Allocation– Maximize System Utilization

5

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

TI DSP Overview

• Common Features of DSP Processor– Specialized Addressing Mode

• Circular addressing (Buffer), Bit-reversed addressing (FFT)

– Specialized Program Control• High Performance Interrupt handling• Pipeline Structure

– I/O Interface and On-Chip Peripherals• On-chip A/D, D/A converts• Timers• External Memory Interface (EMIF)• Direct Memory Access (DMA)• Multi-channel Buffered Serial Ports (McBSP)

6

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

TI DSP Platform

7

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

TI DSP Platform

• TMS 320 DSP Platforms– C2000 DSP Platforms (C24x , C28x)

• Control Optimized• Motor control 、 Digital control

– C5000 DSP Platforms (C54x , C55x)• Best Power Efficiency but slow• Handset 、 PDA 、 Digital camera

– C6000 DSP Platforms (C62x,C64x,C67x(floating))• High Performance• Image/Video processing, Wireless system

– OMAP• ARM(RISC CPU) + C55x• Smart phone 、 PDA

8

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

C6000 Family

Performance

9

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

C6416T DSP Platform• DSP Core

– VLIW DSP core ,Fully Software-Compatible with C64x– 64 32-Bit General-Purpose Registers, 8 Functional Units – 1-GHz Clock Rate, 32-bit fixed-point DSP, 8000MIPS

• Memory– 16K-Byte L1P(program cache) and L1D(data cache) – 1024K-Byte L2 cache (unified mapped)– 256MB of SDRAM @133MHz

• I/O– Two External Memory interfaces (EMIFs)– Enhanced Direct Memory Access (64 channels)– PLL clock Generator– Three 32-Bit General-Purpose Timers– IEEE-1149.1 (JTAG interface)

10

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Architecture of C6416T

11

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

C6416T DSP Platform

• Advantage– VLIW Architecture (compiler optimization)– Instruction Packing Reduces Code Size– Parallel processing with multiple function units (eight

32-Bit instruction / Cycle)

• Disadvantage– Non-Compatible with C2000 and C5000 family– Power Consumption

12

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Sundance – Carrier(SMT310Q) from Wu

JTAG Emulator

PCI Bridge

13

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Sundance – Module (SMT 395) from Wu

14

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Outline

• Reference

• TI DSP Platform

• Optimization On DSP Platform• Conclusion

15

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

DSP Program Development Flow• Phase 1

– Develop C code without any knowledge of the C6000

• Phase 2– Use the intrinsics, sh

ell options, and coding techniques to improve C code

• Phase 3– Rewrite the time-critic

al function in linear assembly

Write C code

Compile

Profile

Efficient Complete

Refine C code

Compile

Profile

Efficient

More C optimation

Write linear assembly

Assembly optimize

Profile

Efficient

Complete

Complete

Phase 1:Develop C Code

Phase 2:Refine C Code

Phase 3:Write Linear Assembly

Yes

Yes

Yes

Yes

No

No

No

16

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Language-based DSP Development Flow.c

.sa

.obj

.out

.asm

Link .cmd

Debug

BIOSLibrary

Graph

Profile

17

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Code Efficiency vs. Coding effort

Source Efficiency Coding Effort

Compiler Optimizer

Assembly Optimizer

Hand OptimizerAssembly

Linear Assembly

C sourceFile

50-80%

90-100%

100%

Low

Med

High

18

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Software Development Tool (CCS)

• Code Composer Studio (CCS)– Software pipeline is very important– Build option setting

• Optimization level : File level (-o3)• Program level optimization : Combine source to perform

program-level optimization (-pm-op2)

– Configurations• Release mode is faster than Debug mode (profile)

19

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Code Optimization

• Fixed Point Operation– Fixed point ： char, short, int, long– Floating point ： float, double

• Computation cycles with different data typesChar8-bit

Short16-bit

Int32-bit

Long40-bit

Float32-bit

Double64-bit

Add 1 1 1 2 77 146

Mul 2 2 6 8 54 69

20

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Code Optimization

• Use Intrinsic Function• Packet Data Processing• Change int type to char, short type

– Put 2 16-bit data or 4 8-bit data in a 32-bit space– Single instruction multiple data (SIMD) (intrinsic)

int

short

char

short

char char char

32 bitsA1 (short) A2 (short)

B1 (short) B2 (short)

+

=A1+B1 A2+B2

SIMD

21

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Code Optimization

• Loop unrolling– Break the branch barrier– Trade off between performance and code size– #pragma MUST_ITERATE(min, max, multiple), – #pragma UNROLL(n)

#pragma MUST_ITERATE(10)For (i=0;i<N;i++){ ……..}

22

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Code Optimization

• Using Macros function– Software-pipelined loop cannot contain function calls– Trade off between performance and code size

• Linear assembly– register usage (64 register)– parallel instructions ( || ) (option)– Determine functional unit (8 FUs) (option)

void h264_loop_filter_luma_c(uint8_t *pix, int xstride, int ystride, int alpha, int beta) {………}

.def _h264_loop_filter_luma_c_h264_loop_filter_luma_c: .cproc pix,xstride,ystride,alpha,beta .reg i,D,p0,p1,p2,q0,q1,q2,tc,tcb ……….. || ADD .D1 p0,p1,q1

.endproc

23

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Memory Optimization

• Memory Management is important– Designer’s Responsibility

• Memory Load/Store is critical– 80% time for load/store

• Cache Configuration Linker Command File (*.cmd)– Allocate memory

24

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Command FileMEMORY{ ISRAM: o = 0x00000000 l = 0x00040000 SDRAM: o = 0x80000000 l = 0x08000000}SECTIONS{ .text > ISRAM //Code .cinit > ISRAM //Initial values for global/static variables .stack > ISRAM //Stack (local variables) .const > ISRAM //Global and static string literals .switch > ISRAM //Tables for switch instructions .cio > ISRAM //Buffers for studio functions .bss > ISRAM //Global and static variables .far > ISRAM //Global and static declared far .sysmem > SDRAM //Memory for malloc functions (heap) .mycode > ISRAM .mydata > ISRAM} -stack 0x1F74-heap 0x500000

#pragma CODE_SECTION(function_name,”mycode”)

#pragma DATA_SECTION(array_name,”mydata”)

25

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Other Optimization Method

• Enhanced Direct Memory Access (EDMA)– DMA access memory v.s. CPU process data– Ping-Pong Buffer

• FPGA– Hardware acceleration

26

Institu

te of E

lectron

ics, Natio

nal C

hia

o T

ung

Un

iversity

Conclusion

• Introduction to TI DSP Platforms• Optimization Program on TI DSP Platforms

Documents

Optimization On TI DSP System Yu-Chang Huang( 黃育彰 ) 2006/01/05 Dept. Electronics Engineering, National Chiao Tung University