View
234
Download
1
Embed Size (px)
Citation preview
Optimization On TI DSP System
Yu-Chang Huang(黃育彰 )
2006/01/05
Dep
t. Ele
ctron
ics E
ng
ineerin
g, N
atio
nal C
hia
o T
ung
Un
iversity
2
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Outline
• Reference
• TI DSP Platform
• Optimization On DSP Platform
• Conclusion
3
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Reference
• Texas Instrument -- http://www.ti.com/
• Texas Instrument Reference Guide, ” TMS320C6000 CPU and Instruction Set”, 2000
• http://www.sundance.com/
• 吳俊榮 , “Introdection to Sundance DSP-Development System”, Aug. 3 ,2005
• 王盈閔 , “MPEG-4 AAC Codec Acceleration and DSP implementation”,June,2004
• 旺陽電企業有限公司 -- http://www.vpdsp.com/
• 王逸如 , 陳信宏 , “ 數位訊號處理的新利器 TMS320 C6X” 修訂版 , 全華出版社 ,2004 年 3 月
4
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
DSP Based Systems
• Software-Oriented Embedded System– 80% of the development effort is software– Accelerate development time– System flexibility
• Real-Time Processing System– Real time implementation
• Event-Driven multi-tasking System– Facilitate Resource Allocation– Maximize System Utilization
5
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
TI DSP Overview
• Common Features of DSP Processor– Specialized Addressing Mode
• Circular addressing (Buffer), Bit-reversed addressing (FFT)
– Specialized Program Control• High Performance Interrupt handling• Pipeline Structure
– I/O Interface and On-Chip Peripherals• On-chip A/D, D/A converts• Timers• External Memory Interface (EMIF)• Direct Memory Access (DMA)• Multi-channel Buffered Serial Ports (McBSP)
6
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
TI DSP Platform
7
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
TI DSP Platform
• TMS 320 DSP Platforms– C2000 DSP Platforms (C24x , C28x)
• Control Optimized• Motor control 、 Digital control
– C5000 DSP Platforms (C54x , C55x)• Best Power Efficiency but slow• Handset 、 PDA 、 Digital camera
– C6000 DSP Platforms (C62x,C64x,C67x(floating))• High Performance• Image/Video processing, Wireless system
– OMAP• ARM(RISC CPU) + C55x• Smart phone 、 PDA
8
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
C6000 Family
Performance
9
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
C6416T DSP Platform• DSP Core
– VLIW DSP core ,Fully Software-Compatible with C64x– 64 32-Bit General-Purpose Registers, 8 Functional Units – 1-GHz Clock Rate, 32-bit fixed-point DSP, 8000MIPS
• Memory– 16K-Byte L1P(program cache) and L1D(data cache) – 1024K-Byte L2 cache (unified mapped)– 256MB of SDRAM @133MHz
• I/O– Two External Memory interfaces (EMIFs)– Enhanced Direct Memory Access (64 channels)– PLL clock Generator– Three 32-Bit General-Purpose Timers– IEEE-1149.1 (JTAG interface)
10
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Architecture of C6416T
11
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
C6416T DSP Platform
• Advantage– VLIW Architecture (compiler optimization)– Instruction Packing Reduces Code Size– Parallel processing with multiple function units (eight
32-Bit instruction / Cycle)
• Disadvantage– Non-Compatible with C2000 and C5000 family– Power Consumption
12
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Sundance – Carrier(SMT310Q) from Wu
JTAG Emulator
PCI Bridge
13
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Sundance – Module (SMT 395) from Wu
14
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Outline
• Reference
• TI DSP Platform
• Optimization On DSP Platform• Conclusion
15
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
DSP Program Development Flow• Phase 1
– Develop C code without any knowledge of the C6000
• Phase 2– Use the intrinsics, sh
ell options, and coding techniques to improve C code
• Phase 3– Rewrite the time-critic
al function in linear assembly
Write C code
Compile
Profile
Efficient Complete
Refine C code
Compile
Profile
Efficient
More C optimation
Write linear assembly
Assembly optimize
Profile
Efficient
Complete
Complete
Phase 1:Develop C Code
Phase 2:Refine C Code
Phase 3:Write Linear Assembly
Yes
Yes
Yes
Yes
No
No
No
16
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Language-based DSP Development Flow.c
.sa
.obj
.out
.asm
Link .cmd
Debug
BIOSLibrary
Graph
Profile
17
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Code Efficiency vs. Coding effort
Source Efficiency Coding Effort
Compiler Optimizer
Assembly Optimizer
Hand OptimizerAssembly
Linear Assembly
C sourceFile
50-80%
90-100%
100%
Low
Med
High
18
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Software Development Tool (CCS)
• Code Composer Studio (CCS)– Software pipeline is very important– Build option setting
• Optimization level : File level (-o3)• Program level optimization : Combine source to perform
program-level optimization (-pm-op2)
– Configurations• Release mode is faster than Debug mode (profile)
19
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Code Optimization
• Fixed Point Operation– Fixed point : char, short, int, long– Floating point : float, double
• Computation cycles with different data typesChar8-bit
Short16-bit
Int32-bit
Long40-bit
Float32-bit
Double64-bit
Add 1 1 1 2 77 146
Mul 2 2 6 8 54 69
20
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Code Optimization
• Use Intrinsic Function• Packet Data Processing• Change int type to char, short type
– Put 2 16-bit data or 4 8-bit data in a 32-bit space– Single instruction multiple data (SIMD) (intrinsic)
int
short
char
short
char char char
32 bitsA1 (short) A2 (short)
B1 (short) B2 (short)
+
=A1+B1 A2+B2
SIMD
21
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Code Optimization
• Loop unrolling– Break the branch barrier– Trade off between performance and code size– #pragma MUST_ITERATE(min, max, multiple), – #pragma UNROLL(n)
#pragma MUST_ITERATE(10)For (i=0;i<N;i++){ ……..}
22
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Code Optimization
• Using Macros function– Software-pipelined loop cannot contain function calls– Trade off between performance and code size
• Linear assembly– register usage (64 register)– parallel instructions ( || ) (option)– Determine functional unit (8 FUs) (option)
void h264_loop_filter_luma_c(uint8_t *pix, int xstride, int ystride, int alpha, int beta) {………}
.def _h264_loop_filter_luma_c_h264_loop_filter_luma_c: .cproc pix,xstride,ystride,alpha,beta .reg i,D,p0,p1,p2,q0,q1,q2,tc,tcb ……….. || ADD .D1 p0,p1,q1
.endproc
23
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Memory Optimization
• Memory Management is important– Designer’s Responsibility
• Memory Load/Store is critical– 80% time for load/store
• Cache Configuration Linker Command File (*.cmd)– Allocate memory
24
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Command FileMEMORY{ ISRAM: o = 0x00000000 l = 0x00040000 SDRAM: o = 0x80000000 l = 0x08000000}SECTIONS{ .text > ISRAM //Code .cinit > ISRAM //Initial values for global/static variables .stack > ISRAM //Stack (local variables) .const > ISRAM //Global and static string literals .switch > ISRAM //Tables for switch instructions .cio > ISRAM //Buffers for studio functions .bss > ISRAM //Global and static variables .far > ISRAM //Global and static declared far .sysmem > SDRAM //Memory for malloc functions (heap) .mycode > ISRAM .mydata > ISRAM} -stack 0x1F74-heap 0x500000
#pragma CODE_SECTION(function_name,”mycode”)
#pragma DATA_SECTION(array_name,”mydata”)
25
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Other Optimization Method
• Enhanced Direct Memory Access (EDMA)– DMA access memory v.s. CPU process data– Ping-Pong Buffer
• FPGA– Hardware acceleration
26
Institu
te of E
lectron
ics, Natio
nal C
hia
o T
ung
Un
iversity
Conclusion
• Introduction to TI DSP Platforms• Optimization Program on TI DSP Platforms