Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CHAPTER 4
StarCore 3850 DSP
1
StarCore Architecture Roadmap
V7 products…• Enhanced Controlcode Support
MMU S tV6
815x
SC3850 products…
• Dual MAC
• MMU Support• Additional ASI• Enhanced Video• Dynamic Branch
Prediction V5 SC3400 products
8144/E/EC1 GHz and
beyond• Additional SIMD
Instructions• Memory Protection• Prediction MXC (2.5G, 3G, 3.5G)
V3
SC3400 products…beyond (90nm)
MSC8101/3, MSC8122/26/12/13, Wireless subscriber2
V3 SC140e products…
• Up to 6-Issue VLIW Architecture• VLES• SIMD
V2 SC1000 products…
2
StarCore Feature EvolutionSC140 (V2) SC140e (V3) SC3400 (V5) SC3850 (V6)
Shared features 6 issue, statically scheduled VLES model: 4 DALU + 2 AGU128-bit instruction fetch, 2x64 data ports (3 memory accesses per cycle)
Backward Binary compatibility between all family members (16 bit basic inst. set)Pipeline Stages(max freq @SOI)
5 (600MHz @90G)
5 (250MHz @90LP)
12(1Ghz @90SOI)
12(1Ghz @45SOI)(max freq @SOI) (600MHz @90G) (250MHz @90LP) (1Ghz @90SOI) (1Ghz @45SOI)
Instructions Baseline Minor additions Video, SIMD2 Control ISA, Dual MPY
C h i t tiCache instructionsPrecise exceptions
No No Yes Yes
Privilege levels No Yes Yes YesPrivilege levels No Yes Yes Yes
Micro-arch. Features
1 VLES speculation 1 VLES speculation BTB, 4 VLES spec.,1 COF deep
BTB, 4 VLES spec.,nested COF
Platform M1 + L1 Icache MMU, L1 I/D cache L1, MMU, M2 L1, MMU, L2/M2
3
SC3850 - Key Benefits
► Deep pipeline - high frequency
► Zero overhead HW loops
D i b h di ti (BTB) M t COF t k l 1 l► Dynamic branch prediction (BTB) - Most COF take only 1 cycle
► Efficient Instructions• Application Specific Instructions• SIMD• Better compiler support
► Six instructions per cycle
► Eight fully orthogonal true MACs / cycle
► Wide stack operations and fast access to stackp
4
SC3850 - Key Benefits (Cont.)
► Precise exception support• Allow system recovery from critical situations• Allow a virtual memory system
► Small code size
► Ease of programmingp g g• Protected pipeline – No need for manual hazard control• Easy to create tight and efficient kernels• Ability to write kernels directly, without intermediary meansy y y
► Backward binary compatible with SC140/SC3400
5
SC3850 Block Diagram
D/I interface bus DMA bus Peripheral bus
_ADD
R
ADDR
22 DATA
4 _DAT
A
64DDR
ATA
Memory Sub-System WRQ
AGU DALUPCU
XA_
XB_A 3232
XB_ 64
XA_ 6
XP_A
D
32
XP_D
A
128
PCUBTB Data RegistersAddress
RegistersPAG
PDUREG
ALU0 ALU1 BMU
PDU
COF
REGMAC3a
MAC3b
Logic3
MAC2a
MAC2b
Logic2
MAC1a
MAC1b
Logic1
MAC0a
MAC0b
Logic0INT
OCERSU R St ll U it
Inst. Bus
Logic3Logic2Logic1Logic0
RSU – Resource Stall Unit
6
SC3850 Programming Model
Address / Base Address Registers0 3131 0
Data Registers
016 1532 3139
DALUAGU
/016 1532 3139 R0R1R2R3R4R5
R8/B0R9/B1R10/B2R11/B3R12/B4R13/B5
Limit Tag Bit (LIMIT)
Extension (EXT)
High Portion (HP)
Low Portion (LP)
L0 D0.e D0.h D0.lL1 D1.e D1.h D1.lL2 D2.e D2.h D2.l
Modifier Registers Offset Registers Modifier Control
R5R6R7
SP (NSP, ESP)
R13/B5R14/B6R15/B7
L2 D2.e D2.h D2.lL3 D3.e D3.h D3.lL4 D4.e D4.h D4.lL5 D5.e D5.h D5.lL6 D6.e D6.h D6.lL7 D7.e D7.h D7.lL8 D8 D8 h D8 l
g31 0 31
Offset Registers0 31 Register
MCTL0
M0M1M2M3
N0N1N2N3
L8 D8.e D8.h D8.lL9 D9.e D9.h D9.lL10 D10.e D10.h D10.lL11 D11.e D11.h D11.lL12 D12.e D12.h D12.lL13 D13.e D13.h D13.l
0Status Register
310Program Counter
31 031
Exception and Mode Register 031
Implementation Dependent
Configuration Register
General Configuration
PCU
L14 D14.e D14.h D14.lL15 D15.e D15.h D15.l
SRPC EMR
SA0031
SA1LC0
031
LC1 VBA031
Address Register
IDCR031 Register
031
Back TraceRegisters
SA2
SA3LC2
LC3
BTR0
BTR1
7
DALU Highlights
►MPY/MAC - Eight operations per cycle (16 bit precision each)
►FFT dedicated instructions
►Add/Subtract instructions
►Multi bit Shifts (arithmetical/logical) and logical operation instructions
►Arithmetic Instructions – ABS, NEG, RND, SAT, sign/zero extension
►Bit Field Instructions – EXTRACT, INSERT
►Multimedia dedicated instructions
►Data types of 8,16,20 and 40 bits
8
SC3850 DSP Core Data Processing Throughput
►DALU calculations are based on 40-bit registers►The two multipliers of each ALU can be used in various ways:
• SIMD2 or dot-product multiplicationComple m ltiplication• Complex multiplication
• Extended precision multiplication (16x32, 32x32)
Operation Precision Operations per cycle
Real Multiply 16x16 8
16x32 4
Kernel SC3850
Real block FIR 16x16 NT/8
32x32 2
Complex Multiply 16x16 2
16x32 1
Complex FIR 16x16 NT/2
Dot Product 16x16 N/4
N: samplesT: Taps
9
Dual Multiply ISAmac Da,Db,DnDn + (Da.H * Db.H) -> Dn
Single MAC operation (SC140/SC3400)
Hi h P ti L P tiDa
High Portion Low Portion16-bit
Db 16-bit
Dn 40-bit
Dual MAC – SIMD2 MAC (SC3850)Dual MAC – double throughput MAC (SC3850)dmac Da,Db,Dn
mac2 Da,Db,Dn
Dn.WH + (Da.H * Db.H) -> Dn.WH
16-bitHigh Portion Low Portion
Da 16-bit16-bitHigh Portion Low Portion
Da 16-bit
Dn + (Da.H * Db.H)+ (Da.L * Db.L -> DnDn.WH + (Da.H Db.H) > Dn.WHDn.WL + (Da.L * Db.L) -> Dn.WL
Db 16-bit
40 bit
16-bit16-bitDb
D
16-bit
20-bit 20-bit Dn 40-bitDn 20 bit 20 bit
10
AGU Highlights
►2 Identical AGU’s►Move instructions►Move instructions
• Memory -+ Data/Address regs (load)• Data/Address regs -+ Memory (store)• MOVEx yB - load/store Bytes• MOVEx.yB - load/store Bytes• MOVE.yW - Integer 16-bit Words (or packed bytes)• MOVEx.yF – Fraction 16-bit Words (or packed bytes)• MOVEx yL – 32-bit long (or packed 16-bit or packed 8-bit)• MOVEx.yL – 32-bit long (or packed 16-bit or packed 8-bit)• MOVEx.yBF - Fractional Bytes
►PUSH/POP to both stacksB i i t ith ti dd i t (i l di d l►Basic integer arithmetic on address registers (including modulo addressing)
►AGU logic instructions: ANDA, ORA, EORA…►BMU - Bit Mask Unit
• BIT test, clear, change, test&set, etc.• Operand – data/address regs or memoryOperand data/address regs or memory
11
Modulo addressing
►In modulo addressing, value is calculated implicitlymove.l d2,(r0)+ move.w (r0)+n0,d5
BnBnIf Rn - disp < Bn :
Rn := Rn - disp + M
If Rn + disp ≥ Bn + M :
RnM
Rn := Rn + disp - M
► Also valid for indexed moves, but without the register update• ex: move.w (r0+n0),d0
12
VLES Architecture
►Instructions are grouped into Variable Length Execution Sets. AVLES may contain up to 2 AGU and 4 DALU instructions*.
► Example:
[ move.w (r0),d0 mac d0 d1 d2
move.w (r1)+,d2 add d3 d4 ]mac d0,d1,d2 add d3,d4 ]
►VLES are packed in memoryNo need to pad with NOPs (contrary to classic VLIW)No need to pad with NOPs (contrary to classic VLIW)
13
Fetch and Execution Sets
• Execution Sets in Fetch Order
TIME
• Execution Sets in Dispatch Order
TIME
ALU 1DALU 0 ALU 1DALU 1 ALU 1DALU 2 ALU 1DALU 3 ALU 1AGU 0 ALU 1AGU 1
14
SC3850 DSP Core and Subsystem
►SC3850 SUBSYSTEM advantages►SC3850 SU S S ad a tages• Memory management unit (MMU)
• Flexible memory protection - Easier debug, faster time to market• Address translationAddress translation• Better MTBF (mean time between errors)
• L1 data and instr. caches – 2*32 KB, 8 way, hardware and software pre-fetch• Private L2 cache 512 KB unified data/Instr dynamically defined as M2• Private L2 cache – 512 KB, unified data/Instr., dynamically defined as M2• Debug and profiling - smart breakpoints, non intrusive profiling capabilities
15
SC3850 DSP Core and Subsystem
►SC3850 CORE advantages►SC3850 CO ad a tages• 8MACs/cycle, 1GHz => 8GMACS per core at 45nm• High performance for deep pipeline architecture – advanced branch prediction
Control code efficiency hardware support for stack many control oriented instr• Control code efficiency - hardware support for stack, many control-oriented instr.• Easy programming - interlocked pipeline, backward compatible with all SC
devicesI t i i MAC f ti lit (V MPY + ADD) 8 GMAC• Intrinsic MAC functionality (Vs. MPY + ADD) – 8 GMAC per core
• Multicore support - semaphore support (read-modify-write)16
SC3850 DSP Sub-System Features – Caches
► Caches optimized to give best performance reducing TTM► L1 caches
• Instructions and Data caches both: 32KB, 8 way• Data cache supports Write Back allocate and Write Through policies
• Advanced automatic pre-fetching:d a ced auto at c p e etc g• Line pre-fetch with critical word first and next line pre-fetch
• SW-controlled pre-fetching with cache control instructions
►L2/M2 memory system►L2/M2 memory system• 512 KB, configurable as L2 cache or M2 SRAM in
64KB banks• M2 SRAM accessible by DMA
L2 h 8 ifi d d d t• L2 cache: 8-ways, unified program and data• Programmable cache way partitioning according
to address ranges• Low latency to the core (10-12 cycles)• SW-triggered DMA like Pre-fetch channels
operate in the background• DMA based “Stashing” to DDRz
17
SC3850 DSP Sub-System Features – Benefits of the MMU
►Memory protection, translation, and precise exceptions►Simpler, abstract SW model - not SoC specificp p►Good support for multi-core devices
• Code written once, unaware of the core it will actually run on• Specific memory allocated per channel instance, on a specific core
►Easier debug, faster time to market• MMU errors quickly catch when a task accesses out of bounds• Virtual addressing allows simpler code re-use
►Better MTBF (Mean Time Between Errors)• Channels are isolated from each other and
from system codeS d d P i il d i• System code and Privileged registers protected in Supervisor level
• An errant task will not bring down the whole systemsystem
• Precise exceptions serviced before the error executes, allowing recovery in some casescases
18
SC3850 DSP Sub-System Features – Debug and Profile
►Debug:Ri h b k i t biliti• Rich brakepoint capabilities
• Cache aware debug• PC trace with task information• Remote debug capability
►Profile:• Performance optimization using
detailed core stall information
Hold due to WRQ or WTB
detailed core stall information• Measuring RTOS and system
overhead.Profiling at a function level
Hold due to Dcache
Hold due to WRQ Retry Mechanism
• Profiling at a function level
• Constraint violation monitorsHold due to Dcache
19
SC3850 DSP Sub-System Optimization Mechanisms
► L2 Cache software pre-fetch (SWPF), L1 DFETCH and PFETCH
L2 SWPF ofcode2 and/or
data2
L2 SWPF ofcode3 and/or
data3 B k d f t h
Legend:Fetch
“SW Pipeline”data2
PFETCH(code2)
data3
PFETCH(code3)
Inline fetch into L1caches
Background fetchinto L2 caches
p
(code2) and/or
DFETCH(data2)
(code3) and/or
DFETCH(data3)
In reality: Smaller and more frequent
Task1(code1,data1)Execution
( ) ( )
Task2(code2,data2) Execution
Task3(code3,data3) ExecutionExecution
time
20
Cache vs. DMA Model in SC3850 DSP Subsystem
Cache SW modelScheduled Cache SW modelMixed ModelDMA SW Model
100% M2 L2 is partly M2 100% L2100% L2 + SWPF
• All in M2 All i DDR/M3• All in M2• Highest performance• High effort• Generate higher bus load• Expert Mode Higher TTM
• All in DDR/M3• Use SWPF• Use L2 Cache partitioning• High performance
M d t ff t• Expert Mode – Higher TTM
• Critical code/data in M2• Consider using L2 Cache partitioning
• Moderate effort
• All in DDR/M3• Good performance• High performance
• Moderate-High effort• Good performance• Low effort
rtEf
for
21
Summary
►SC3850 cores and sub-systems are optimized using advancedcore architecture and high performance and flexible multi-levelcache system
►Doubling the DSP computational capacity and significantlyboosting compiled code performance
►Memory system which significantly improves initial startup►Memory system which significantly improves initial startupefficiency and removes the need to DMA data in and out as intraditional DSP devices by adding smart pre-fetching mechanisms
22