96
Computer Architecture Lecture 8: Vector Processing (Chapter 4) ChihWei Liu 劉志尉 National Chiao Tung University [email protected]

Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Computer ArchitectureLecture 8 Vector Processing

(Chapter 4)

Chih‐Wei Liu 劉志尉

National Chiao Tung Universitycwliutwinseenctuedutw

Introduction

bull SIMD architectures can exploit significant data‐level parallelism forndash matrix‐oriented scientific computingndash media‐oriented image and sound processors

bull SIMD is more energy efficient than MIMDndash Only needs to fetch one instruction per data operationndash Makes SIMD attractive for personal mobile devices

bull SIMD allows programmer to continue to think sequentially

CA-Lec8 cwliutwinseenctuedutw 2

Introduction

SIMD Variations

bull Vector architecturesbull SIMD extensions

ndash MMX multimedia extensions (1996)ndash SSE streaming SIMD extensionsndash AVX advanced vector extensions

bull Graphics Processor Units (GPUs)ndash Considered as SIMD accelerators

CA-Lec8 cwliutwinseenctuedutw

Introduction

3

SIMD vs MIMD

bull For x86 processorsbull Expect two additional cores per chip per year

bull SIMD width to double every four years

bull Potential speedup from SIMD to be twice that from MIMD

CA-Lec8 cwliutwinseenctuedutw 4

Vector Architecturesbull Basic idea

ndash Read sets of data elements into ldquovector registersrdquondash Operate on those registersndash Disperse the results back into memory

bull Registers are controlled by compilerndash Register files act as compiler controlled buffersndash Used to hide memory latencyndash Leverage memory bandwidth

bull Vector loadsstores deeply pipelinedndash Pay for memory latency once per vector loadstore

bull Regular loadsstoresndash Pay for memory latency for each vector element

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

5

Vector Supercomputers

bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer

ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound

problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer

bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions

bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units

CA-Lec8 cwliutwinseenctuedutw 6

Cray‐1 (1976)

Single PortMemory

16 banks of 64-bit words+ 8-bit SECDED

80MWsec data loadstore

320MWsec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 125 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V Mask

V Length64 Element Vector Registers

CA-Lec8 cwliutwinseenctuedutw 7

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic

ADDV v3 v1 v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and Store

LV v1 r1 r2

Base r1 Stride r2 Memory

Vector Register

Vector Programming Model

CA-Lec8 cwliutwinseenctuedutw 8

Example VMIPS

CA-Lec8 cwliutwinseenctuedutw 9

bull Loosely based on Cray‐1bull Vector registers

ndash Each register holds a 64‐element 64 bitselement vector

ndash Register file has 16 read ports and 8 write ports

bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected

bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency

bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers

VMIPS Instructionsbull Operate on many elements concurrently

bull Allows use of slow but wide execution unitsbull High performance lower power

bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks

bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision

CA-Lec8 cwliutwinseenctuedutw 10

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 2: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Introduction

bull SIMD architectures can exploit significant data‐level parallelism forndash matrix‐oriented scientific computingndash media‐oriented image and sound processors

bull SIMD is more energy efficient than MIMDndash Only needs to fetch one instruction per data operationndash Makes SIMD attractive for personal mobile devices

bull SIMD allows programmer to continue to think sequentially

CA-Lec8 cwliutwinseenctuedutw 2

Introduction

SIMD Variations

bull Vector architecturesbull SIMD extensions

ndash MMX multimedia extensions (1996)ndash SSE streaming SIMD extensionsndash AVX advanced vector extensions

bull Graphics Processor Units (GPUs)ndash Considered as SIMD accelerators

CA-Lec8 cwliutwinseenctuedutw

Introduction

3

SIMD vs MIMD

bull For x86 processorsbull Expect two additional cores per chip per year

bull SIMD width to double every four years

bull Potential speedup from SIMD to be twice that from MIMD

CA-Lec8 cwliutwinseenctuedutw 4

Vector Architecturesbull Basic idea

ndash Read sets of data elements into ldquovector registersrdquondash Operate on those registersndash Disperse the results back into memory

bull Registers are controlled by compilerndash Register files act as compiler controlled buffersndash Used to hide memory latencyndash Leverage memory bandwidth

bull Vector loadsstores deeply pipelinedndash Pay for memory latency once per vector loadstore

bull Regular loadsstoresndash Pay for memory latency for each vector element

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

5

Vector Supercomputers

bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer

ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound

problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer

bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions

bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units

CA-Lec8 cwliutwinseenctuedutw 6

Cray‐1 (1976)

Single PortMemory

16 banks of 64-bit words+ 8-bit SECDED

80MWsec data loadstore

320MWsec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 125 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V Mask

V Length64 Element Vector Registers

CA-Lec8 cwliutwinseenctuedutw 7

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic

ADDV v3 v1 v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and Store

LV v1 r1 r2

Base r1 Stride r2 Memory

Vector Register

Vector Programming Model

CA-Lec8 cwliutwinseenctuedutw 8

Example VMIPS

CA-Lec8 cwliutwinseenctuedutw 9

bull Loosely based on Cray‐1bull Vector registers

ndash Each register holds a 64‐element 64 bitselement vector

ndash Register file has 16 read ports and 8 write ports

bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected

bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency

bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers

VMIPS Instructionsbull Operate on many elements concurrently

bull Allows use of slow but wide execution unitsbull High performance lower power

bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks

bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision

CA-Lec8 cwliutwinseenctuedutw 10

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 3: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

SIMD Variations

bull Vector architecturesbull SIMD extensions

ndash MMX multimedia extensions (1996)ndash SSE streaming SIMD extensionsndash AVX advanced vector extensions

bull Graphics Processor Units (GPUs)ndash Considered as SIMD accelerators

CA-Lec8 cwliutwinseenctuedutw

Introduction

3

SIMD vs MIMD

bull For x86 processorsbull Expect two additional cores per chip per year

bull SIMD width to double every four years

bull Potential speedup from SIMD to be twice that from MIMD

CA-Lec8 cwliutwinseenctuedutw 4

Vector Architecturesbull Basic idea

ndash Read sets of data elements into ldquovector registersrdquondash Operate on those registersndash Disperse the results back into memory

bull Registers are controlled by compilerndash Register files act as compiler controlled buffersndash Used to hide memory latencyndash Leverage memory bandwidth

bull Vector loadsstores deeply pipelinedndash Pay for memory latency once per vector loadstore

bull Regular loadsstoresndash Pay for memory latency for each vector element

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

5

Vector Supercomputers

bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer

ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound

problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer

bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions

bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units

CA-Lec8 cwliutwinseenctuedutw 6

Cray‐1 (1976)

Single PortMemory

16 banks of 64-bit words+ 8-bit SECDED

80MWsec data loadstore

320MWsec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 125 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V Mask

V Length64 Element Vector Registers

CA-Lec8 cwliutwinseenctuedutw 7

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic

ADDV v3 v1 v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and Store

LV v1 r1 r2

Base r1 Stride r2 Memory

Vector Register

Vector Programming Model

CA-Lec8 cwliutwinseenctuedutw 8

Example VMIPS

CA-Lec8 cwliutwinseenctuedutw 9

bull Loosely based on Cray‐1bull Vector registers

ndash Each register holds a 64‐element 64 bitselement vector

ndash Register file has 16 read ports and 8 write ports

bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected

bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency

bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers

VMIPS Instructionsbull Operate on many elements concurrently

bull Allows use of slow but wide execution unitsbull High performance lower power

bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks

bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision

CA-Lec8 cwliutwinseenctuedutw 10

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 4: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

SIMD vs MIMD

bull For x86 processorsbull Expect two additional cores per chip per year

bull SIMD width to double every four years

bull Potential speedup from SIMD to be twice that from MIMD

CA-Lec8 cwliutwinseenctuedutw 4

Vector Architecturesbull Basic idea

ndash Read sets of data elements into ldquovector registersrdquondash Operate on those registersndash Disperse the results back into memory

bull Registers are controlled by compilerndash Register files act as compiler controlled buffersndash Used to hide memory latencyndash Leverage memory bandwidth

bull Vector loadsstores deeply pipelinedndash Pay for memory latency once per vector loadstore

bull Regular loadsstoresndash Pay for memory latency for each vector element

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

5

Vector Supercomputers

bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer

ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound

problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer

bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions

bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units

CA-Lec8 cwliutwinseenctuedutw 6

Cray‐1 (1976)

Single PortMemory

16 banks of 64-bit words+ 8-bit SECDED

80MWsec data loadstore

320MWsec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 125 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V Mask

V Length64 Element Vector Registers

CA-Lec8 cwliutwinseenctuedutw 7

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic

ADDV v3 v1 v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and Store

LV v1 r1 r2

Base r1 Stride r2 Memory

Vector Register

Vector Programming Model

CA-Lec8 cwliutwinseenctuedutw 8

Example VMIPS

CA-Lec8 cwliutwinseenctuedutw 9

bull Loosely based on Cray‐1bull Vector registers

ndash Each register holds a 64‐element 64 bitselement vector

ndash Register file has 16 read ports and 8 write ports

bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected

bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency

bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers

VMIPS Instructionsbull Operate on many elements concurrently

bull Allows use of slow but wide execution unitsbull High performance lower power

bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks

bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision

CA-Lec8 cwliutwinseenctuedutw 10

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 5: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Architecturesbull Basic idea

ndash Read sets of data elements into ldquovector registersrdquondash Operate on those registersndash Disperse the results back into memory

bull Registers are controlled by compilerndash Register files act as compiler controlled buffersndash Used to hide memory latencyndash Leverage memory bandwidth

bull Vector loadsstores deeply pipelinedndash Pay for memory latency once per vector loadstore

bull Regular loadsstoresndash Pay for memory latency for each vector element

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

5

Vector Supercomputers

bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer

ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound

problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer

bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions

bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units

CA-Lec8 cwliutwinseenctuedutw 6

Cray‐1 (1976)

Single PortMemory

16 banks of 64-bit words+ 8-bit SECDED

80MWsec data loadstore

320MWsec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 125 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V Mask

V Length64 Element Vector Registers

CA-Lec8 cwliutwinseenctuedutw 7

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic

ADDV v3 v1 v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and Store

LV v1 r1 r2

Base r1 Stride r2 Memory

Vector Register

Vector Programming Model

CA-Lec8 cwliutwinseenctuedutw 8

Example VMIPS

CA-Lec8 cwliutwinseenctuedutw 9

bull Loosely based on Cray‐1bull Vector registers

ndash Each register holds a 64‐element 64 bitselement vector

ndash Register file has 16 read ports and 8 write ports

bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected

bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency

bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers

VMIPS Instructionsbull Operate on many elements concurrently

bull Allows use of slow but wide execution unitsbull High performance lower power

bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks

bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision

CA-Lec8 cwliutwinseenctuedutw 10

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 6: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Supercomputers

bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer

ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound

problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer

bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions

bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units

CA-Lec8 cwliutwinseenctuedutw 6

Cray‐1 (1976)

Single PortMemory

16 banks of 64-bit words+ 8-bit SECDED

80MWsec data loadstore

320MWsec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 125 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V Mask

V Length64 Element Vector Registers

CA-Lec8 cwliutwinseenctuedutw 7

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic

ADDV v3 v1 v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and Store

LV v1 r1 r2

Base r1 Stride r2 Memory

Vector Register

Vector Programming Model

CA-Lec8 cwliutwinseenctuedutw 8

Example VMIPS

CA-Lec8 cwliutwinseenctuedutw 9

bull Loosely based on Cray‐1bull Vector registers

ndash Each register holds a 64‐element 64 bitselement vector

ndash Register file has 16 read ports and 8 write ports

bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected

bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency

bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers

VMIPS Instructionsbull Operate on many elements concurrently

bull Allows use of slow but wide execution unitsbull High performance lower power

bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks

bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision

CA-Lec8 cwliutwinseenctuedutw 10

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 7: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Cray‐1 (1976)

Single PortMemory

16 banks of 64-bit words+ 8-bit SECDED

80MWsec data loadstore

320MWsec instructionbuffer refill

4 Instruction Buffers

64-bitx16 NIP

LIP

CIP

(A0)

( (Ah) + j k m )

64T Regs

(A0)

( (Ah) + j k m )

64 B Regs

S0S1S2S3S4S5S6S7

A0A1A2A3A4A5A6A7

Si

Tjk

Ai

Bjk

FP Add

FP Mul

FP Recip

Int Add

Int Logic

Int Shift

Pop Cnt

Sj

Si

Sk

Addr Add

Addr Mul

Aj

Ai

Ak

memory bank cycle 50 ns processor cycle 125 ns (80MHz)

V0V1V2V3V4V5V6V7

Vk

Vj

Vi V Mask

V Length64 Element Vector Registers

CA-Lec8 cwliutwinseenctuedutw 7

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic

ADDV v3 v1 v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and Store

LV v1 r1 r2

Base r1 Stride r2 Memory

Vector Register

Vector Programming Model

CA-Lec8 cwliutwinseenctuedutw 8

Example VMIPS

CA-Lec8 cwliutwinseenctuedutw 9

bull Loosely based on Cray‐1bull Vector registers

ndash Each register holds a 64‐element 64 bitselement vector

ndash Register file has 16 read ports and 8 write ports

bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected

bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency

bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers

VMIPS Instructionsbull Operate on many elements concurrently

bull Allows use of slow but wide execution unitsbull High performance lower power

bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks

bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision

CA-Lec8 cwliutwinseenctuedutw 10

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 8: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic

ADDV v3 v1 v2 v3

v2v1

Scalar Registers

r0

r15Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1]

VLRVector Length Register

v1Vector Load and Store

LV v1 r1 r2

Base r1 Stride r2 Memory

Vector Register

Vector Programming Model

CA-Lec8 cwliutwinseenctuedutw 8

Example VMIPS

CA-Lec8 cwliutwinseenctuedutw 9

bull Loosely based on Cray‐1bull Vector registers

ndash Each register holds a 64‐element 64 bitselement vector

ndash Register file has 16 read ports and 8 write ports

bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected

bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency

bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers

VMIPS Instructionsbull Operate on many elements concurrently

bull Allows use of slow but wide execution unitsbull High performance lower power

bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks

bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision

CA-Lec8 cwliutwinseenctuedutw 10

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 9: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Example VMIPS

CA-Lec8 cwliutwinseenctuedutw 9

bull Loosely based on Cray‐1bull Vector registers

ndash Each register holds a 64‐element 64 bitselement vector

ndash Register file has 16 read ports and 8 write ports

bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected

bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency

bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers

VMIPS Instructionsbull Operate on many elements concurrently

bull Allows use of slow but wide execution unitsbull High performance lower power

bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks

bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision

CA-Lec8 cwliutwinseenctuedutw 10

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 10: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

VMIPS Instructionsbull Operate on many elements concurrently

bull Allows use of slow but wide execution unitsbull High performance lower power

bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks

bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision

CA-Lec8 cwliutwinseenctuedutw 10

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 11: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

11

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 12: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Memory‐Memory vsVector Register Machines

bull Vector memory‐memory instructions hold all vector operands in main memory

bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines

bull Cray‐1 (rsquo76) was first vector register machine

for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]

Example Source Code ADDV C A BSUBV D A B

Vector Memory-Memory Code

LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D

Vector Register Code

CA-Lec8 cwliutwinseenctuedutw 12

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 13: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Memory‐Memory vs Vector Register Machines

bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory

bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses

bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100

elementsndash For Cray‐1 vectorscalar breakeven point was around 2

elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector

machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)

CA-Lec8 cwliutwinseenctuedutw 13

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 14: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Instruction Set Advantagesbull Compact

ndash One short instruction encodes N operationsbull Expressive

ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous

instructionndash N operations access a contiguous block of memory (unit‐stride

loadstore)ndash N operations access memory in a known pattern (stridden loadstore)

bull Scalablendash Can run same object code on more parallel pipelines or lanes

CA-Lec8 cwliutwinseenctuedutw 14

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 15: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Instructions Examplebull Example DAXPY

LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result

bull In MIPS Codebull ADD waits for MUL SD waits for ADD

bull In VMIPSbull Stall once for the first vector element subsequent elements will flow

smoothly down the pipeline bull Pipeline stall required once per vector instruction

CA-Lec8 cwliutwinseenctuedutw 15

adds a scalar multiple of a double precision vector to another double precision vector

Requires 6 instructions only

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 16: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Challenges of Vector Instructionsbull Start up time

ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP

ndash Latency of vector functional unitndash Assume the same as Cray‐1

bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

16

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 17: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

V1 V2 V3

V3 lt- v1 v2

Six stage multiply pipeline

Vector Arithmetic Execution

bull Use deep pipeline (=gt fast clock) to execute element operations

bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)

CA-Lec8 cwliutwinseenctuedutw 17

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 18: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Instruction Execution

CA-Lec8 cwliutwinseenctuedutw 18

ADDV CAB

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

Execution using one pipelined functional unit

C[4]

C[8]

C[0]

A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]

C[5]

C[9]

C[1]

A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]

C[6]

C[10]

C[2]

A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]

C[7]

C[11]

C[3]

A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]

Four-laneexecution using four pipelined functional units

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 19: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Unit Structure

Lane

Functional Unit

VectorRegisters

Memory Subsystem

Elements 0 4 8 hellip

Elements 1 5 9 hellip

Elements 2 6 10 hellip

Elements 3 7 11 hellip

CA-Lec8 cwliutwinseenctuedutw 19

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 20: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Multiple Lanes Architecturebull Beyond one element per clock cycle

bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes

ndash No communication between lanes

ndash Little increase in control overhead

ndash No need to change machine code

CA-Lec8 cwliutwinseenctuedutw 20

Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 21: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Code Vectorization

CA-Lec8 cwliutwinseenctuedutw 21

for (i=0 i lt N i++)C[i] = A[i] + B[i]

load

load

add

store

load

load

add

store

Iter 1

Iter 2

Scalar Sequential Code

Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis

Vector Instruction

load

load

add

store

load

load

add

store

Iter 1 Iter 2

Vectorized Code

Tim

e

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 22: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Execution Timebull Execution time depends on three factors

ndash Length of operand vectorsndash Structural hazardsndash Data dependencies

bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length

bull Convoyndash Set of vector instructions that could potentially execute together

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

22

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 23: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

load

Vector Instruction ParallelismCan overlap execution of multiple vector instructions

ndash example machine has 32 elements per vector register and 8 lanes

loadmul

mul

add

add

Load Unit Multiply Unit Add Unit

time

Complete 24 operationscycle while issuing 1 short instructioncycle

CA-Lec8 cwliutwinseenctuedutw 23

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 24: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Convoy

bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys

bull However sequences with RAW hazards can be in the same convey via chaining

CA-Lec8 cwliutwinseenctuedutw 24

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 25: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Chainingbull Vector version of register bypassing

ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available

Memory

V1

Load Unit

Mult

V2 V3

Chain

Add

V4 V5

Chain

LV v1MULV v3 v1v2ADDV v5 v3 v4

CA-Lec8 cwliutwinseenctuedutw 25

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 26: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Chaining Advantage

CA-Lec8 cwliutwinseenctuedutw 26

bull With chaining can start dependent instruction as soon as first result appears

LoadMul

Add

LoadMul

AddTime

bull Without chaining must wait for last element of result to be written before starting dependent instruction

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 27: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Chimes

bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles

CA-Lec8 cwliutwinseenctuedutw 27

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 28: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum

Convoys1 LV MULVSD2 LV ADDVVD3 SV

3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles

CA-Lec8 cwliutwinseenctuedutw

Vector Architectures

28

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 29: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Strip‐mining

ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do

for (i=0 iltN i++)C[i] = A[i]+B[i]

+

+

+

A B C

64 elements

Remainder

Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo

CA-Lec8 cwliutwinseenctuedutw 29

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 30: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Length Register

bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length

Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop

for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation

low = low + VL start of next vectorVL = MVL reset the length to maximum vector length

CA-Lec8 cwliutwinseenctuedutw 30

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 31: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Maximum Vector Length (MVL)

bull Determine the maximum number of elements in a vector for a given architecture

bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor

bull Later generations may grow the MVLndash No need to change the ISA

CA-Lec8 cwliutwinseenctuedutw 31

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 32: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector‐Mask Control

CA-Lec8 cwliutwinseenctuedutw 32

C[4]

C[5]

C[1]

Write data port

A[7] B[7]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

Density-Time Implementationndash scan mask vector and only execute

elements with non-zero masks

C[1]

C[2]

C[0]

A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0

M[1]=1

M[0]=0

Write data portWrite Disable

A[7] B[7]M[7]=1

Simple Implementationndash execute all N operations turn off result

writeback according to mask

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 33: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Mask Register (VMR)

bull A Boolean vector to control the execution of a vector instruction

bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly

bull For GPU it gets the same effect using hardwarendash Invisible to SW

bull Both GPU and vector processor spend time on masking

CA-Lec8 cwliutwinseenctuedutw 33

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 34: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector

processorbull Consider

for (i = 0 i lt 64 i=i+1)if (X[i] = 0)

X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements

LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X

bull GFLOPS rate decreases

CA-Lec8 cwliutwinseenctuedutw 34

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 35: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

CompressExpand Operationsbull Compress packs non‐masked elements from one vector register

contiguously at start of destination vector registerndash population count of mask vector gives packed vector length

bull Expand performs inverse operation

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

A[3]A[4]A[5]A[6]A[7]

A[0]A[1]A[2]

M[3]=0M[4]=1M[5]=1M[6]=0

M[2]=0M[1]=1M[0]=0

M[7]=1

B[3]A[4]A[5]B[6]A[7]

B[0]A[1]B[2]

Expand

A[7]

A[1]A[4]A[5]

Compress

A[7]

A[1]A[4]A[5]

Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 36: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first

word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector

bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit

bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses

to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each

processor will be generating its own independent stream of addresses

CA-Lec8 cwliutwinseenctuedutw 36

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 37: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per

cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed

bull Solution1 The maximum number of memory references each cycle is

326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory

bandwidth

CA-Lec8 cwliutwinseenctuedutw 37

Cray T932 actually has 1024 memory banks (not sustain full bandwidth)

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 38: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Memory Addressingbull Loadstore operations move groups of data between

registers and memorybull Three types of addressing

ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this

ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth

ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize

CA-Lec8 cwliutwinseenctuedutw 38

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 39: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Interleaved Memory Layout

bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read

bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4

Vector Processor

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

UnpipelinedD

RAM

AddrMod 8= 0

AddrMod 8= 1

AddrMod 8= 2

AddrMod 8= 4

AddrMod 8= 5

AddrMod 8= 3

AddrMod 8= 6

AddrMod 8= 7

CA-Lec8 cwliutwinseenctuedutw 39

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 40: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Handling Multidimensional Arrays in Vector Architectures

bull Considerfor (i = 0 i lt 100 i=i+1)

for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]

bull Must vectorize multiplications of rows of B with columns of D

ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in

column‐major orderbull Once vector is loaded into the register it acts as if it has logically

adjacent elements

CA-Lec8 cwliutwinseenctuedutw 40

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 41: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single

register is called stridebull In the example

ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word

bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a

dense structurebull The size of the matrix may not be known at compile time

ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐

purpose register (dynamic)bull Cache inherently deals with unit stride data

ndash Blocking techniques helps non‐unit stride data

CA-Lec8 cwliutwinseenctuedutw 41

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 42: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

How to get full bandwidth for unit stride

bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle

ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types

bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full

bandwidthndash can read paper on how to do prime number of banks efficiently

CA-Lec8 cwliutwinseenctuedutw 42

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 43: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Memory Bank Conflicts

bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks

ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank

CA-Lec8 cwliutwinseenctuedutw 43

int x[256][512]for (j = 0 j lt 512 j = j+1)

for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 44: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Problems of Stride Addressing

bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently

ndash Memory bank conflict

bull Stall the other request to solve bank conflict

bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time

ndash banks LCM(stridebanks) lt bank busy time

CA-Lec8 cwliutwinseenctuedutw 44

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 45: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Examplebull 8 memory banks Bank busy time 6 cycles Total memory

latency 12 cycles for initializedbull What is the difference between a 64‐element vector load

with a stride of 1 and 32

bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76

cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride

ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles

ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element

CA-Lec8 cwliutwinseenctuedutw 45

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 46: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Handling Sparse Matrices in Vector architecture

bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

ndash where K and M and index vectors to designate the nonzero elements of A and C

bull Sparse matrix elements stored in a compact form and accessed indirectly

bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered

CA-Lec8 cwliutwinseenctuedutw 46

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 47: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Scatter‐Gatherbull Consider

for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]

bull Ra Rc Rk and Rm contain the starting addresses of the vectors

bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]

bull A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 cwliutwinseenctuedutw 47

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 48: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Vector Architecture Summarybull Vector is alternative model for exploiting ILP

ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order

bull More lanes slower clock rate

ndash Scalable if elements are independentndash If there is dependency

bull One stall per vector instruction rather than one stall per vector element

bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of

vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth

ndash Especially with virtual address translation and caching

CA-Lec8 cwliutwinseenctuedutw 48

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 49: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks

CA-Lec8 cwliutwinseenctuedutw 49

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 50: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set

ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary

bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units

busyndash loop unrolling to hide latencies increases register pressure

bull Trend towards fuller vector support in microprocessors

CA-Lec8 cwliutwinseenctuedutw 50

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 51: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)

ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits

ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands

bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler

+

CA-Lec8 cwliutwinseenctuedutw 51

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 52: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b

ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b

bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b

ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack

ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large

CA-Lec8 cwliutwinseenctuedutw 52

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 53: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

SIMD Implementations IA32AMD64

bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops

bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit

integerfp opsndash Single‐precision floating‐point arithmetic

bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic

bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations

CA-Lec8 cwliutwinseenctuedutw 53

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 54: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

SIMD Implementations IBMbull VMX (1996‐1998)

ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement

bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers

bull VMX 128 (Xbox360)ndash Extension to 128 registers

bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers

bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP

CA-Lec8 cwliutwinseenctuedutw 54

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 55: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Why SIMD Extensions

bull Media applications operate on data types narrower than the native word size

bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory

ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely

bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture

CA-Lec8 cwliutwinseenctuedutw 55

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 56: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Example SIMD Codebull Example DXPY

LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load

Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done

CA-Lec8 cwliutwinseenctuedutw 56

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 57: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Challenges of SIMD Architecturesbull Scalar processor memory architecture

ndash Only access to contiguous databull No efficient scattergather accesses

ndash Significant penalty for unaligned memory accessndash May need to write entire vector register

bull Limitations on data access patternsndash Limited by cache line up to 128‐256b

bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions

bull Register pressurendash Need to use multiple registers rather than register depth to hide

latency

CA-Lec8 cwliutwinseenctuedutw 57

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 58: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Roofline Performance Modelbull Basic idea

ndash Plot peak floating‐point throughput as a function of arithmetic intensity

ndash Ties together floating‐point performance and memory performance for a target machine

bull Arithmetic intensityndash Floating‐point operations per byte read

CA-Lec8 cwliutwinseenctuedutw 58

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 59: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Examples

bull Attainable GFLOPssec

CA-Lec8 cwliutwinseenctuedutw 59

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 60: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

History of GPUs

bull Early video cardsndash Frame buffer memory with address generation for video output

bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles

bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing

CA-Lec8 cwliutwinseenctuedutw 60

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 61: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Graphical Processing Units

bull Basic ideandash Heterogeneous execution model

bull CPU is the host GPU is the device

ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo

CA-Lec8 cwliutwinseenctuedutw 61

Graphical P

rocessing Units

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 62: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Programming Model

bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++

bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language

ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores

CA-Lec8 cwliutwinseenctuedutw 62

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 63: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

NVIDIA GPU Architecture

bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files

bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor

CA-Lec8 cwliutwinseenctuedutw 63

Graphical P

rocessing Units

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 64: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Programming the GPUbull CUDA Programming Model

ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid

bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications

CA-Lec8 cwliutwinseenctuedutw 64

Graphical P

rocessing Units

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 65: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Example

bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes

bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32

ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler

ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors

CA-Lec8 cwliutwinseenctuedutw 65

Graphical P

rocessing Units

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 66: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Terminology

bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions

bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors

bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors

CA-Lec8 cwliutwinseenctuedutw 66

Graphical P

rocessing Units

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 67: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Example

bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to

bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements

ndash Fermi has 16 physical SIMD lanes each containing 2048 registers

CA-Lec8 cwliutwinseenctuedutw 67

Graphical P

rocessing Units

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 68: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

GTX570 GPUGlobal Memory

1280MB

Texture Cache8KB

Constant Cache8KB

Shared Memory48KB

L1 Cache16KB

L2 Cache640KB

SM 0

Registers32768

Shared Memory48KB

SM 14

Registers32768

32 cores

32cores

Up to 1536 ThreadsSM

CA-Lec8 cwliutwinseenctuedutw 68

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 69: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

GPU Threads in SM (GTX570)

bull 32 threads within a block work collectively Memory access optimization latency hiding

CA-Lec8 cwliutwinseenctuedutw 69

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 70: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Matrix Multiplication

CA-Lec8 cwliutwinseenctuedutw 70

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 71: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Programming the GPU

CA-Lec8 cwliutwinseenctuedutw 71

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 72: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Matrix Multiplicationbull For a 4096x4096matrix multiplication

‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means

we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell

at a timebull Each thread exploits the GPUs fine granularity by computing

one element of Matrix Cbull Sub‐matrices are read into shared memory from global

memory to act as a buffer and take advantage of GPU bandwidth

CA-Lec8 cwliutwinseenctuedutw 72

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 73: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Programming the GPU

bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory

_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block

CA-Lec8 cwliutwinseenctuedutw 73

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 74: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)

DAXPY in Cvoid daxpy(int n double a double x double y)

for (int i=0iltni++)y[i]= ax[i]+ y[i]

Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)

DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)

int i=blockIDxxblockDimx+threadIdxxif (iltn)

y[i]= ax[i]+ y[i]

CA-Lec8 cwliutwinseenctuedutw 74

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 75: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example

shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])

CA-Lec8 cwliutwinseenctuedutw 75

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 76: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Conditional Branching

bull Like vector architectures GPU branch hardware uses internal masksbull Also uses

ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)

ndash Instruction markers to manage when a branch diverges into multiple execution paths

bull Push on divergent branch

ndash hellipand when paths convergebull Act as barriersbull Pops stack

bull Per‐thread‐lane 1‐bit predicate register specified by programmer

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

76

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 77: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Exampleif (X[i] = 0)

X[i] = X[i] ndash Y[i]else X[i] = Z[i]

ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits

if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits

if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]

stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

77

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 78: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

NVIDIA GPU Memory Structures

bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables

bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block

bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

78

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 79: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Fermi Architecture Innovationsbull Each SIMD processor has

ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4

special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock

cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

79

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 80: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Fermi Multithreaded SIMD Proc

CA-Lec8 cwliutwinseenctuedutw

Graphical P

rocessing Units

80

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 81: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Compiler Technology for Loop‐Level Parallelism

bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations

bull Loop‐level parallelism has no loop‐carried dependence

bull Example 1for (i=999 igt=0 i=i‐1)

x[i] = x[i] + s

bull No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

81

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 82: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Example 2 for Loop‐Level Parallelism

bull Example 2for (i=0 ilt100 i=i+1)

A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2

bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence

bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

82

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 83: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Remarks

bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence

bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1

bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements

CA-Lec8 cwliutwinseenctuedutw 83

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 84: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Example 3 for Loop‐Level Parallelism (1)

bull Example 3for (i=0 ilt100 i=i+1)

A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2

bull S1 uses value computed by S2 in previous iteration but this dependence is not circular

bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2

CA-Lec8 cwliutwinseenctuedutw 84

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 85: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Example 3 for Loop‐Level Parallelism (2)

bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)

B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1

B[100] = C[99] + D[99]

bull The dependence between the two statements is no longer loop carried

CA-Lec8 cwliutwinseenctuedutw 85

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 86: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

More on Loop‐Level Parallelism

for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]

bull The second reference to A needs not be translated to a load

instructionndash The two reference are the same There is no intervening memory

access to the same location

bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize

CA-Lec8 cwliutwinseenctuedutw 86

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 87: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Recurrence

bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example

for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]

bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence

ndash It my still be possible to exploit a fair amount of ILP

CA-Lec8 cwliutwinseenctuedutw 87

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 88: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Compiler Technology for Finding Dependences

bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences

bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i

is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension

is affinendash Non‐affine access example x[ y[i] ]

bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop

CA-Lec8 cwliutwinseenctuedutw 88

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 89: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Finding dependencies Examplebull Assume

ndash Load an array element with index c i + d and store to a i+ b

ndash i runs from m to nbull Dependence exists if the following two conditions hold

1 Given j k such that m le j le n m le k le n2 a j + b = c k + d

bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test

bull If a loop‐carried dependence exists then GCD(ca) | |db|

CA-Lec8 cwliutwinseenctuedutw 89

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 90: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Example

for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50

bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible

CA-Lec8 cwliutwinseenctuedutw 90

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 91: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Remarks

bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists

bull Determining whether a dependence actually exists is NP‐complete

CA-Lec8 cwliutwinseenctuedutw 91

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 92: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Finding dependenciesbull Example 2

for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4

bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried

bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])

CA-Lec8 cwliutwinseenctuedutw

Detecting and E

nhancing Loop-Level Parallelism

92

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 93: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Renaming to Eliminate False (Pseudo) Dependences

bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]

bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]

CA-Lec8 cwliutwinseenctuedutw 93

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 94: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Eliminating Recurrence Dependence

bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)

sum = sum + x[i] y[i]

bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip

for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]

for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]

CA-Lec8 cwliutwinseenctuedutw 94

This is called scalar expansionScalar ===gt VectorParallel

This is called a reductionSums up the elements of the vectorNot parallel

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 95: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector

and SIMD architecturendash Similar to what can be done in multiprocessor environment

bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)

finalsum[p] = finalsum[p] + sum[i+1000p]

ndash Assume p ranges from 0 to 9

CA-Lec8 cwliutwinseenctuedutw 95

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96

Page 96: Computer Architecture Lecture 8: Vector …twins.ee.nctu.edu.tw/courses/ca_13/lecture/CA_lec08...Computer Architecture Lecture 8: Vector Processing (Chapter 4) Chih‐Wei Liu 劉志尉

Multithreading and Vector Summary

bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading

ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading

based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers

bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy

efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number

of vector registers length of vector registers exception handling conditional operations and so on

bull Fundamental design issue is memory bandwidth

CA-Lec8 cwliutwinseenctuedutw 96