Lecture ASIP 5

7/28/2019 Lecture ASIP 5

1/59

VLSI Architecture :: MEL G642

MEL G642

Dr. A. Amalin Prince

BITS Pilani K.K. Birla Goa Campus

Department of Electrical , Electronics and Instrumentation Engineering


2/59

System on a chip

What is a DSP core?

What is a DSP processor?

What is a DSP subsystem?

MCU is the task controller-executes tasks without real-time

requirements

MEL G642

MCU subsystemDSP subsystem

DSP Processor core

RF Control path

DM PM Interrupt TimerDMA MMU

CustomerIF

ALU MAC

Bus with arbitration

Main memories

MCU core

AGU

SoCdesignhierarchy

CustomerIF


3/59

G642

rchite

cture

Bus and arbiration

MMU

Main memories

RF Control pathADG

DM DM PMDMA

ubsys

tem

MEL G642

ME

L

Processor

DSP Processor

DSP core

Interrupt TimerOther pheriph

Chip inferface

ALUMAC accelerator DSP


4/59

Architecture and microarchitecture

The processor architecture is the hardware organization

of the core and its peripherals including the memory bus

architecture. Architecture represents relations of modules The microarchitecture design is the specification of

functional modules

MEL G642

ASIP microarchitecture design is the implementation ofan ISA specification into hardware modules.


5/59

Inside a core

The core can be divided into three parts:

the datapath, the control path, and the address generation unit

(AGU). The core components are organized around two data

busses:

MEL G642

The memory bus is distributed between the core and thememory subsystem.

The register bus connects the register file to all units in the core.


6/59

Memory subsystem in a DSP subsystem

The memory subsystem consists of

data memories (DM),

program (code) memory (PM), AGU, DMA, and MMU.

MEL G642


7/59

Peripherals in a DSP subsystem

Timers for counting clock cycles and events

Interrupt controller for handling interrupts

DMA (Direct Memory Access) controller for handlingdata transfers to/from main memory and between other

memories/ports

MEL G642

MMU (Memory Management Unit) for reliable andefficient (address space) memory usage


8/59

DSP memory architecture

MEL G642


9/59

History of DSP memory architectures

Memory

Control Arithmetic

Programmemory Datamemory

MEL G642

un un t

In-out

(a) Von Neumann architecture

Controlunit Arithmeticunit

In-out

(b) Harvard architecture


10/59

History of DSP memory architectures

DP DM

CP PM

DP DM

CP PM

MUX

DP DM

CP PM

MUX

MEL G642

(a) (b) (c)

One tap of

convolution requires

multiple clock

Fetch coefficients

instead of

instructions during

CONV

Dual port/multi-port

memory required.

Used up to 1980s


11/59


12/59

A typical DSP bus architecture

Register _File ALU MAC

OPA

OPB

ressingpath

(AGU)

Register

bus

Datapath

MEL G642

-a ress

D1-data

D2-address

D2-data

P-address

Program

PM DM1 DM2ControlPa

th(CP)andadd

Memorybu

s

PMbus


13/59

Control flow of DSP ASIP

Calculate PC Request an instruction Receive an instruction

Send PC

to PM

Get code

from PMreset

MEL G642

Receive states from DP Generate operandaddresses

Decode the instructionand send control to DP

Control

signals to DPGenerated addressto storage units

Flags

from DP

Instruction Flow FSM


14/59

Data flow of DSP ASIP

Receiveinstruction

Receive operand

address

Fetch

operands

From PMFrom address

generatorSend address

to storage HW

MEL G642

Return

statesStore

result

Execute

instruction

Flags toPC FSM

Send result to

storage HW

Data Flow FSM


15/59

G642

faDSP

processor

RF

UPCFSM

PM

Program

address

gura

tion

status

xecun

itALU

/MAC

Instruction

Program flow control

Results

struc

tion

dec

oder

Operand

&

result

co

ntrol

MEL G642

MEL

Acompleteview

o

A

Con

fi

an

d

D

Operation ctrl

MEM ctrl

Legend

Data bus Control signals

Memory busInternal signals

in control path

I


16/59

Modules in a core

MEL G642


17/59

Modules in a DSP core

Datapath

Register file

ALU MAC

AGU

MEL G642

Control path


18/59

Differences between design of DSP and MPU

The MPU designers think ofultimate performance and

ultimate flexibility as well as the compiler-friendly

instruction set. The ASIP DSP designers think ofapplication and cost

first, and the challenge is to be efficient.

MEL G642

The goal of an ASIP design is to reach the highestperformance over silicon, the highest performance over

power consumption, the highest performance over the

design cost.


19/59

Is DSP CISC or RISC

a DSP, like a RISC:

More general-purpose registers.

Most instructions as simple instructions. Instruction decoding by decoding logic circuit instead of

microcode.

MEL G642

egu ar ns ruc on p pe n ng.

a DSP, like a CISC:

One execution cycle for ALU and multiple cycles for iteration. Complicated data memory addressing modes and circuits.

Special-purpose registers (accumulator registers).

Strong instructions for accelerating certain tasks.


20/59

Is DSP CISC or RISC

DSP RISC CISC

Emphasis on hardware

and software

Emphasis on software Emphasis on hardware

Single and multiclockcomplex instructions Single-clock, reduced instructiononly Includes multiclockcomplex instructions

Operands from registers

Operands also from

Operands only from registers

LOAD and STORE are used to

Arithmetic computing based

on memory-to-memory

MEL G642

data memories

- -

register variables

Small code size Large code size Small code size

Most silicon area used for

program and data storing

Most silicon area used for

program and data storing

Silicon might be used for

storing complex instructions

(microcode)


21/59

Design instruction set

MEL G642


22/59

G642

tdesi

gnflow

S ource code profiling: c overage and 10-90% lo cality

D esign o f ge neral R ISC instructio ns

D esign of C ISC accele rate d ins tructions

De sig n of mi sc ellaneous ins tructions

MEL G642

MEL

Instru

ctions

Instruc tion s et simu la tor and a ssem bler

Benc hmarking performa nce a nd covera ge

Release the ins truc tion set archi tec ture

N o

ye s

ns t ruc t on c o ng an r e ea se manua

satisfied


23/59

Release an instruction set

Design of

assembly

instruction set

Instruction set

benchmarkingApplication

profiling

MEL G642

When

Benchmarking result equivalent to requirements


24/59

We need to identify problems

How is an instruction set designed and why is it designed

in that way?

In which circumstances should a function beimplemented using an instruction instead of a subroutine?

Why ASIP DSP instructions not really RISC

MEL G642

Why my benchmarking is not satisfactory?


25/59

What is the starting point

Let us start at the point to implement C functions to an

assembly instruction set

A typical architecture with two DM in parallel Instructions including move-load-store, ALU/MAC, and

program flow control

MEL G642


26/59

Classify the Instruction set

Instruction

group /type

Operands Operations Mathematical

description

Flags CC

Load, store,

and move

Register name

and memoryaddressing

Data transfer

and addressingmodes

DST (ADR)


27/59

Move-load-store instructions

RISC processor architecture simple.

Data and parameters of a subroutine are loaded to the

register file first. Operands are from register file or immediate data carried

by an instruction.

MEL G642

Results in the register file need to be moved back to thedata memory


28/59

Move-load-store instructions

Mnem Operand Description Operation CC

Load Rd, DA Load data from memory

0/1

RdDM(DA) 1

Store DA, Rs Store data to memory

0/1

DM(DA) Rs 1

MEL G642

move Rd, Rs Move between two

registers

Rd Rs 1

move Rd, K Move immediate data to

a register

Rd immediate 1


29/59

Addressing for data memory access

Memory addressing is addressing algorithm carried by anassembly instruction.

It specifies the way to calculate the memory the uniquelocation of data in a data memory for a read or a write.

MEL G642

Implicitly addressing algorithm in C; explicitly algorithmin ASM


30/59

Addressing for data memory access

Name DA DA code

cost (b)

Memory Algorithm CC

Direct D 16 DM0/1 16-bit constant as the direct

memory address

1

Register

indirect

R 5 DM0/1 A register containing the memory

address

1

=

MEL G642

incremental

,

addressingRegister

decrement

--R 5 DM0/1 R=R1 before addressing, R gives

address

1


31/59

Arithmetic logic instructions

Basic arithmetic operations in C are +, , , /, and %.

The modulo operation % is not used very often for DSP

arithmetic computing, to implement it using a subroutine. Division operation / is not easy to implement in

hardware

MEL G642


32/59

Basic Arithmetic Instructions

Mnem Operand Description Operation Flags CC

ADD Rd, Rr Add Rd Ra + Rb Z,N,V 1

SUB Rd, Rr Subtract Rd Ra - Rb Z,N,V 1

ABS Rd, Rr Absolute operation RdABS(Ra) Z,N,V 1

INC Rd Increment Rd Ra + 1 Z,N,V 1

DEC Rd Decrement Rd Ra - 1 Z,N,V 1

MEL G642

MPL A, Rd, Rr Multiplication A

Ra Rb Z,N,V 1MAC A, Rd, Rr Multiplication and

accumulation

AA + Ra Rb Z,N,V 2

RND Rd, A Round, saturate,

and truncate

Rd Saturate(Round(A)) Z,N,V 1

CAC A Clear an

accumulator

A 0 Z,N,V 1


33/59

Logic and Shift Operations

Logic and shift operations in C

&(and), |(or), ~(not), ^(xor),

> (right shift).

Here "and" operates on each bit of operand A and B; that

is, C[0]=A[0] & B[0], C[1]=A[1] & B[1],

MEL G642

C[15]=A[15] & B[15].

L i d Shif O i


34/59

Logic and Shift Operations

Mnem Operand Description Operation Flags CC

AND Ra, Rb A logic-and B Rd Ra and Rb C, Z 1

OR Ra, Rb A logic-or B Rd Ra or Rb C, Z 1

NOT Ra, Rb Invert A Rd INV (Ra) C, Z 1

XOR Ra, Rb A logic-xor B Rd Ra xor Rb C, Z 1

MEL G642

LS Ra, Rb Logic left shift Rd Ra left shifted byRb [3:0]

C, Z 1

RS Ra, Rb Logic right shift Rd Ra right shifted by

Rb [3:0]

C, Z 1

L i O i C


35/59

Logic Operators in C

Condition symbol Conditions

< Less than

= Greater than or equal to

> Greater than

MEL G642

!= Not equal to

&& Boolean AND

|| Boolean OR

! Boolean NOT

P fl t l i C


36/59

Program flow control in C

Conditional and unconditional controls in C. Unconditional GOTO operations.

Conditional: Condition test and jump in C are integrated, for

example, if A then B else C.

In an assembl lan ua e

MEL G642

Condition test and condition jump are separated the first instruction offers and flag computation

the second instruction is the conditional jump

P fl t l i t ti


37/59

Program flow control instructions

Mnem Description Condit

ions

Flags

meet

CC

JLT Jump when Less than < N=1 3/1

JLE Jump when Less than or Equal to N=0 and

Z=0

3/1

JNE Jump when Not Equal to != Z=0 3/1

JUMP Unconditional jump 3

CALL Jump, push return address into stack 3

Return Return to the stacked address 3

Target addressing for jumping


38/59

Target addressing for jumping

TA Algorithm

Absolute 16 bits constant

Relative In a general register

MEL G642

y


39/59

G642

ionSe

tSumm

ary

MEL G642

M

EL

A

ssembl

yInstru

c


40/59

Benchmarking theinstruction set

MEL G642

What is benchmark


41/59

What is benchmark

DSP benchmarking gets cycle cost and code size used by

a DSP algorithm with single-precision data.

Convention of DSP benchmarking round is required before moving long data from an accumulation

register to a general register

MEL G642


42/59

How to benchmark


43/59

How to benchmark

BDTI benchmarking convention

It measures the execution time (cycle cost), the code size

(program memory cost), and the cost of data memories.

The cycle cost = prologue + Kernel + epilogue

MEL G642

Prologue: preparing for running a program,

Epilogue: terminating the program

Kernel: the part of the algorithm

Assumption in this discussion


44/59

Assumption in this discussion

Data frame size: 40 samples.

The number of FIR taps = 16.

The cycle cost = 1 cycle per normal instruction 3 cyclesfor jump taken.

MAC takes one c cle if the followin instruction does

MEL G642

not use the data in an accumulator register. TSMD: a typical single MAC DSP (TSMD)

processor available as a COTS (commercial off-the shelf).

Example: Block Transfer


45/59


C-code: DM1 (SEG: 0 to 39) -> DM1 (SEG: 0 to 39)

Assembly code

MEL G642



46/59


Processor Algorithm Total cycle

cost

Pro-epilogue

cycle cost

Kernel

cycle cost

Total code

cost

Code for pro-

epilogue

DM

cost

Basic (ours) BT 242 4238

8 4 84

TSMD 47 4

437 4 84

MEL G642

The loop: The extra cost of each jump taken and DEC of theloop counter consumes four clock cycles. HW loop may

eliminate the cost.

Load and store can be merged to a memory move to memoryinstruction.

Example: Single sample FIR


47/59


Modulo addressing

FIFO Emulated in a data memory

Can be hardware accelerated memory addressing (for accelerated

instructions)

MEL G642

Example: consider 7-tap FIR Filter


48/59

Example: consider 7 tap FIR Filter

MEL G642


49/59



50/59

p g p

Assembly code

MEL G642

Example: Single sample FIR: FIFO behavior


51/59

p g p

DM X (n-3)X (n-4)

X (n)

DARX (n-4)

X (n)

X (n-1)

MIN address

DAR

BAR BARStep 0 Step 1

MEL G642

Thed

atamemoryspace

TAR

BAR

TheFIFO

buffer

BAR + 0

BAR + 1

BAR + 2

BAR + 3

BAR + 4

X (n-1)

X (n-2)

X (n-2)

X (n-3)

Example: The procedure a FIFO getting a new data sample

before getting

new data

MAX address

after getting

new data 1

X (n)

X (n-1)X (n-2)

X (n-3)

X (n-4) DAR

X (n-1)

X (n-2)X (n-3)

X (n-4)

X (n)

after getting

new data 2

DAR

after getting

new data 3

TAR TAR

TAR

BAR

TAR

BAR

Step 2 Step 3

Example: N sample FIR (Single Sample inloop)


52/59

loop)

MEL G642



53/59

p g p

Processor Algorithm Total cycle cost Kernel cycle

cost

Total code

cost

Basic 16-tapFIR 192 173 26

TSMD 16-tapFIR 31 16 15

MEL G642

- . .

times higher than the benchmark of a TSMD. Opportunities for improvement are:

The cost ofSW emulated circular buffer and modulo addressing is high.

o HW circular buffer and modulo addressing is essential.

Data and coefficient loading, MAC, and the loop control can be merged into

one instruction, convolution, which is one of the most frequently used

instructions in DSP.

CONV N DM0(AP0++M) DM1(AP1++)

Example:


54/59

p

FIR Filtering

Auto correlation Autocorrelation is used for finding regularities or periodical

features of a signal

MEL G642

Cross-correlation

Cross-correlation is used for measuring the similarity of a signal

with a known signal pattern

What difference??


55/59

Analyses on identifiedproblems

MEL G642

Lessons Learned


56/59

C does not give parallel features;

The convolution is one of the most used DSP operations, very high

efficiency by having the memory addressing, arithmetic

computing, result store, and program flow control carried out inparallel in one instruction.

It is ossible because the arallel hardware can be or anized in a

MEL G642

pipeline.

Other most frequently used iterative DSP ops can also be

specified into one instruction.

Research work: Why?

Identify the requirement and benchmark it

Conclusion


57/59

An assembly language instruction set must be more

efficient.

Accelerations implemented at arithmetic and algorithmic

levels.

Addressing and memory accesses should be executed in

MEL G642

parallel with arithmetic computing. Program flow control such as loop or conditional execution

shall also be accelerated

ASIP microarchitecture design flow


58/59

Proposed assembly language manual

pe

line

Further expose all micro operations of each assembly instruction

Partiton micro operations into DP, CP, and AP

MEL G642

Propose

dp

steps

Schedule micro operations into each pipeline step

Design for HW multiplexing in DP and AP

Specify microarchitecture and micro operations for CP

Release micro architecture documents

The End :: Thank you for your attention


59/59

Questions?

MEL G642

Documents

Lecture ASIP 5