CHAPTER 4 StarCore3850 DSP - pub.roarusu/SistemeChipComunicatii/Cursuri/Ch... · 2015. 9. 16. · 2 2 DATA 4 _ DATA 6 4 D DR ATA MemorySub-System WRQ AGU DALU PCU XB_ XA A XP_A 3

CHAPTER 4

StarCore 3850 DSP

1

StarCore Architecture Roadmap

V7 products…• Enhanced Controlcode Support

MMU S tV6

815x

SC3850 products…

• Dual MAC

• MMU Support• Additional ASI• Enhanced Video• Dynamic Branch

Prediction V5 SC3400 products

8144/E/EC1 GHz and

beyond• Additional SIMD

Instructions• Memory Protection• Prediction MXC (2.5G, 3G, 3.5G)

V3

SC3400 products…beyond (90nm)

MSC8101/3, MSC8122/26/12/13, Wireless subscriber2

V3 SC140e products…

• Up to 6-Issue VLIW Architecture• VLES• SIMD

V2 SC1000 products…

2

StarCore Feature EvolutionSC140 (V2) SC140e (V3) SC3400 (V5) SC3850 (V6)

Shared features 6 issue, statically scheduled VLES model: 4 DALU + 2 AGU128-bit instruction fetch, 2x64 data ports (3 memory accesses per cycle)

Backward Binary compatibility between all family members (16 bit basic inst. set)Pipeline Stages(max freq @SOI)

5 (600MHz @90G)

5 (250MHz @90LP)

12(1Ghz @90SOI)

12(1Ghz @45SOI)(max freq @SOI) (600MHz @90G) (250MHz @90LP) (1Ghz @90SOI) (1Ghz @45SOI)

Instructions Baseline Minor additions Video, SIMD2 Control ISA, Dual MPY

C h i t tiCache instructionsPrecise exceptions

No No Yes Yes

Privilege levels No Yes Yes YesPrivilege levels No Yes Yes Yes

Micro-arch. Features

1 VLES speculation 1 VLES speculation BTB, 4 VLES spec.,1 COF deep

BTB, 4 VLES spec.,nested COF

Platform M1 + L1 Icache MMU, L1 I/D cache L1, MMU, M2 L1, MMU, L2/M2

3

SC3850 - Key Benefits

► Deep pipeline - high frequency

► Zero overhead HW loops

D i b h di ti (BTB) M t COF t k l 1 l► Dynamic branch prediction (BTB) - Most COF take only 1 cycle

► Efficient Instructions• Application Specific Instructions• SIMD• Better compiler support

► Six instructions per cycle

► Eight fully orthogonal true MACs / cycle

► Wide stack operations and fast access to stackp

4

SC3850 - Key Benefits (Cont.)

► Precise exception support• Allow system recovery from critical situations• Allow a virtual memory system

► Small code size

► Ease of programmingp g g• Protected pipeline – No need for manual hazard control• Easy to create tight and efficient kernels• Ability to write kernels directly, without intermediary meansy y y

► Backward binary compatible with SC140/SC3400

5

SC3850 Block Diagram

D/I interface bus DMA bus Peripheral bus

_ADD

R

ADDR

22 DATA

4 _DAT

A

64DDR

ATA

Memory Sub-System WRQ

AGU DALUPCU

XA_

XB_A 3232

XB_ 64

XA_ 6

XP_A

D

32

XP_D

A

128

PCUBTB Data RegistersAddress

RegistersPAG

PDUREG

ALU0 ALU1 BMU

PDU

COF

REGMAC3a

MAC3b

Logic3

MAC2a

MAC2b

Logic2

MAC1a

MAC1b

Logic1

MAC0a

MAC0b

Logic0INT

OCERSU R St ll U it

Inst. Bus

Logic3Logic2Logic1Logic0

RSU – Resource Stall Unit

6

SC3850 Programming Model

Address / Base Address Registers0 3131 0

Data Registers

016 1532 3139

DALUAGU

/016 1532 3139 R0R1R2R3R4R5

R8/B0R9/B1R10/B2R11/B3R12/B4R13/B5

Limit Tag Bit (LIMIT)

Extension (EXT)

High Portion (HP)

Low Portion (LP)

L0 D0.e D0.h D0.lL1 D1.e D1.h D1.lL2 D2.e D2.h D2.l

Modifier Registers Offset Registers Modifier Control

R5R6R7

SP (NSP, ESP)

R13/B5R14/B6R15/B7

L2 D2.e D2.h D2.lL3 D3.e D3.h D3.lL4 D4.e D4.h D4.lL5 D5.e D5.h D5.lL6 D6.e D6.h D6.lL7 D7.e D7.h D7.lL8 D8 D8 h D8 l

g31 0 31

Offset Registers0 31 Register

MCTL0

M0M1M2M3

N0N1N2N3

L8 D8.e D8.h D8.lL9 D9.e D9.h D9.lL10 D10.e D10.h D10.lL11 D11.e D11.h D11.lL12 D12.e D12.h D12.lL13 D13.e D13.h D13.l

0Status Register

310Program Counter

31 031

Exception and Mode Register 031

Implementation Dependent

Configuration Register

General Configuration

PCU

L14 D14.e D14.h D14.lL15 D15.e D15.h D15.l

SRPC EMR

SA0031

SA1LC0

031

LC1 VBA031

Address Register

IDCR031 Register

031

Back TraceRegisters

SA2

SA3LC2

LC3

BTR0

BTR1

7

DALU Highlights

►MPY/MAC - Eight operations per cycle (16 bit precision each)

►FFT dedicated instructions

►Add/Subtract instructions

►Multi bit Shifts (arithmetical/logical) and logical operation instructions

►Arithmetic Instructions – ABS, NEG, RND, SAT, sign/zero extension

►Bit Field Instructions – EXTRACT, INSERT

►Multimedia dedicated instructions

►Data types of 8,16,20 and 40 bits

8

SC3850 DSP Core Data Processing Throughput

►DALU calculations are based on 40-bit registers►The two multipliers of each ALU can be used in various ways:

• SIMD2 or dot-product multiplicationComple m ltiplication• Complex multiplication

• Extended precision multiplication (16x32, 32x32)

Operation Precision Operations per cycle

Real Multiply 16x16 8

16x32 4

Kernel SC3850

Real block FIR 16x16 NT/8

32x32 2

Complex Multiply 16x16 2

16x32 1

Complex FIR 16x16 NT/2

Dot Product 16x16 N/4

N: samplesT: Taps

9

Dual Multiply ISAmac Da,Db,DnDn + (Da.H * Db.H) -> Dn

Single MAC operation (SC140/SC3400)

Hi h P ti L P tiDa

High Portion Low Portion16-bit

Db 16-bit

Dn 40-bit

Dual MAC – SIMD2 MAC (SC3850)Dual MAC – double throughput MAC (SC3850)dmac Da,Db,Dn

mac2 Da,Db,Dn

Dn.WH + (Da.H * Db.H) -> Dn.WH

16-bitHigh Portion Low Portion

Da 16-bit16-bitHigh Portion Low Portion

Da 16-bit

Dn + (Da.H * Db.H)+ (Da.L * Db.L -> DnDn.WH + (Da.H Db.H) > Dn.WHDn.WL + (Da.L * Db.L) -> Dn.WL

Db 16-bit

40 bit

16-bit16-bitDb

D

16-bit

20-bit 20-bit Dn 40-bitDn 20 bit 20 bit

10

AGU Highlights

►2 Identical AGU’s►Move instructions►Move instructions

• Memory -+ Data/Address regs (load)• Data/Address regs -+ Memory (store)• MOVEx yB - load/store Bytes• MOVEx.yB - load/store Bytes• MOVE.yW - Integer 16-bit Words (or packed bytes)• MOVEx.yF – Fraction 16-bit Words (or packed bytes)• MOVEx yL – 32-bit long (or packed 16-bit or packed 8-bit)• MOVEx.yL – 32-bit long (or packed 16-bit or packed 8-bit)• MOVEx.yBF - Fractional Bytes

►PUSH/POP to both stacksB i i t ith ti dd i t (i l di d l►Basic integer arithmetic on address registers (including modulo addressing)

►AGU logic instructions: ANDA, ORA, EORA…►BMU - Bit Mask Unit

• BIT test, clear, change, test&set, etc.• Operand – data/address regs or memoryOperand data/address regs or memory

11

Modulo addressing

►In modulo addressing, value is calculated implicitlymove.l d2,(r0)+ move.w (r0)+n0,d5

BnBnIf Rn - disp < Bn :

Rn := Rn - disp + M

If Rn + disp ≥ Bn + M :

RnM

Rn := Rn + disp - M

► Also valid for indexed moves, but without the register update• ex: move.w (r0+n0),d0

12

VLES Architecture

►Instructions are grouped into Variable Length Execution Sets. AVLES may contain up to 2 AGU and 4 DALU instructions*.

► Example:

[ move.w (r0),d0 mac d0 d1 d2

move.w (r1)+,d2 add d3 d4 ]mac d0,d1,d2 add d3,d4 ]

►VLES are packed in memoryNo need to pad with NOPs (contrary to classic VLIW)No need to pad with NOPs (contrary to classic VLIW)

13

Fetch and Execution Sets

• Execution Sets in Fetch Order

TIME

• Execution Sets in Dispatch Order

TIME

ALU 1DALU 0 ALU 1DALU 1 ALU 1DALU 2 ALU 1DALU 3 ALU 1AGU 0 ALU 1AGU 1

14

SC3850 DSP Core and Subsystem

►SC3850 SUBSYSTEM advantages►SC3850 SU S S ad a tages• Memory management unit (MMU)

• Flexible memory protection - Easier debug, faster time to market• Address translationAddress translation• Better MTBF (mean time between errors)

• L1 data and instr. caches – 2*32 KB, 8 way, hardware and software pre-fetch• Private L2 cache 512 KB unified data/Instr dynamically defined as M2• Private L2 cache – 512 KB, unified data/Instr., dynamically defined as M2• Debug and profiling - smart breakpoints, non intrusive profiling capabilities

15

SC3850 DSP Core and Subsystem

►SC3850 CORE advantages►SC3850 CO ad a tages• 8MACs/cycle, 1GHz => 8GMACS per core at 45nm• High performance for deep pipeline architecture – advanced branch prediction

Control code efficiency hardware support for stack many control oriented instr• Control code efficiency - hardware support for stack, many control-oriented instr.• Easy programming - interlocked pipeline, backward compatible with all SC

devicesI t i i MAC f ti lit (V MPY + ADD) 8 GMAC• Intrinsic MAC functionality (Vs. MPY + ADD) – 8 GMAC per core

• Multicore support - semaphore support (read-modify-write)16

SC3850 DSP Sub-System Features – Caches

► Caches optimized to give best performance reducing TTM► L1 caches

• Instructions and Data caches both: 32KB, 8 way• Data cache supports Write Back allocate and Write Through policies

• Advanced automatic pre-fetching:d a ced auto at c p e etc g• Line pre-fetch with critical word first and next line pre-fetch

• SW-controlled pre-fetching with cache control instructions

►L2/M2 memory system►L2/M2 memory system• 512 KB, configurable as L2 cache or M2 SRAM in

64KB banks• M2 SRAM accessible by DMA

L2 h 8 ifi d d d t• L2 cache: 8-ways, unified program and data• Programmable cache way partitioning according

to address ranges• Low latency to the core (10-12 cycles)• SW-triggered DMA like Pre-fetch channels

operate in the background• DMA based “Stashing” to DDRz

17

SC3850 DSP Sub-System Features – Benefits of the MMU

►Memory protection, translation, and precise exceptions►Simpler, abstract SW model - not SoC specificp p►Good support for multi-core devices

• Code written once, unaware of the core it will actually run on• Specific memory allocated per channel instance, on a specific core

►Easier debug, faster time to market• MMU errors quickly catch when a task accesses out of bounds• Virtual addressing allows simpler code re-use

►Better MTBF (Mean Time Between Errors)• Channels are isolated from each other and

from system codeS d d P i il d i• System code and Privileged registers protected in Supervisor level

• An errant task will not bring down the whole systemsystem

• Precise exceptions serviced before the error executes, allowing recovery in some casescases

18

SC3850 DSP Sub-System Features – Debug and Profile

►Debug:Ri h b k i t biliti• Rich brakepoint capabilities

• Cache aware debug• PC trace with task information• Remote debug capability

►Profile:• Performance optimization using

detailed core stall information

Hold due to WRQ or WTB

detailed core stall information• Measuring RTOS and system

overhead.Profiling at a function level

Hold due to Dcache

Hold due to WRQ Retry Mechanism

• Profiling at a function level

• Constraint violation monitorsHold due to Dcache

19

SC3850 DSP Sub-System Optimization Mechanisms

► L2 Cache software pre-fetch (SWPF), L1 DFETCH and PFETCH

L2 SWPF ofcode2 and/or

data2

L2 SWPF ofcode3 and/or

data3 B k d f t h

Legend:Fetch

“SW Pipeline”data2

PFETCH(code2)

data3

PFETCH(code3)

Inline fetch into L1caches

Background fetchinto L2 caches

p

(code2) and/or

DFETCH(data2)

(code3) and/or

DFETCH(data3)

In reality: Smaller and more frequent

Task1(code1,data1)Execution

( ) ( )

Task2(code2,data2) Execution

Task3(code3,data3) ExecutionExecution

time

20

Cache vs. DMA Model in SC3850 DSP Subsystem

Cache SW modelScheduled Cache SW modelMixed ModelDMA SW Model

100% M2 L2 is partly M2 100% L2100% L2 + SWPF

• All in M2 All i DDR/M3• All in M2• Highest performance• High effort• Generate higher bus load• Expert Mode Higher TTM

• All in DDR/M3• Use SWPF• Use L2 Cache partitioning• High performance

M d t ff t• Expert Mode – Higher TTM

• Critical code/data in M2• Consider using L2 Cache partitioning

• Moderate effort

• All in DDR/M3• Good performance• High performance

• Moderate-High effort• Good performance• Low effort

rtEf

for

21

Summary

►SC3850 cores and sub-systems are optimized using advancedcore architecture and high performance and flexible multi-levelcache system

►Doubling the DSP computational capacity and significantlyboosting compiled code performance

►Memory system which significantly improves initial startup►Memory system which significantly improves initial startupefficiency and removes the need to DMA data in and out as intraditional DSP devices by adding smart pre-fetching mechanisms

22

Documents

CHAPTER 4 StarCore3850 DSP - pub.roarusu/SistemeChipComunicatii/Cursuri/Ch... · 2015. 9. 16. · 2 2 DATA 4 _ DATA 6 4 D DR ATA MemorySub-System WRQ AGU DALU PCU XB_ XA A XP_A 3