Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Computer ArchitectureLecture 8 Vector Processing
(Chapter 4)
Chih‐Wei Liu 劉志尉
National Chiao Tung Universitycwliutwinseenctuedutw
Introduction
bull SIMD architectures can exploit significant data‐level parallelism forndash matrix‐oriented scientific computingndash media‐oriented image and sound processors
bull SIMD is more energy efficient than MIMDndash Only needs to fetch one instruction per data operationndash Makes SIMD attractive for personal mobile devices
bull SIMD allows programmer to continue to think sequentially
CA-Lec8 cwliutwinseenctuedutw 2
Introduction
SIMD Variations
bull Vector architecturesbull SIMD extensions
ndash MMX multimedia extensions (1996)ndash SSE streaming SIMD extensionsndash AVX advanced vector extensions
bull Graphics Processor Units (GPUs)ndash Considered as SIMD accelerators
CA-Lec8 cwliutwinseenctuedutw
Introduction
3
SIMD vs MIMD
bull For x86 processorsbull Expect two additional cores per chip per year
bull SIMD width to double every four years
bull Potential speedup from SIMD to be twice that from MIMD
CA-Lec8 cwliutwinseenctuedutw 4
Vector Architecturesbull Basic idea
ndash Read sets of data elements into ldquovector registersrdquondash Operate on those registersndash Disperse the results back into memory
bull Registers are controlled by compilerndash Register files act as compiler controlled buffersndash Used to hide memory latencyndash Leverage memory bandwidth
bull Vector loadsstores deeply pipelinedndash Pay for memory latency once per vector loadstore
bull Regular loadsstoresndash Pay for memory latency for each vector element
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
5
Vector Supercomputers
bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer
ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound
problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer
bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions
bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units
CA-Lec8 cwliutwinseenctuedutw 6
Cray‐1 (1976)
Single PortMemory
16 banks of 64-bit words+ 8-bit SECDED
80MWsec data loadstore
320MWsec instructionbuffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 125 ns (80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V Mask
V Length64 Element Vector Registers
CA-Lec8 cwliutwinseenctuedutw 7
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
ADDV v3 v1 v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and Store
LV v1 r1 r2
Base r1 Stride r2 Memory
Vector Register
Vector Programming Model
CA-Lec8 cwliutwinseenctuedutw 8
Example VMIPS
CA-Lec8 cwliutwinseenctuedutw 9
bull Loosely based on Cray‐1bull Vector registers
ndash Each register holds a 64‐element 64 bitselement vector
ndash Register file has 16 read ports and 8 write ports
bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected
bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency
bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers
VMIPS Instructionsbull Operate on many elements concurrently
bull Allows use of slow but wide execution unitsbull High performance lower power
bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks
bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision
CA-Lec8 cwliutwinseenctuedutw 10
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Introduction
bull SIMD architectures can exploit significant data‐level parallelism forndash matrix‐oriented scientific computingndash media‐oriented image and sound processors
bull SIMD is more energy efficient than MIMDndash Only needs to fetch one instruction per data operationndash Makes SIMD attractive for personal mobile devices
bull SIMD allows programmer to continue to think sequentially
CA-Lec8 cwliutwinseenctuedutw 2
Introduction
SIMD Variations
bull Vector architecturesbull SIMD extensions
ndash MMX multimedia extensions (1996)ndash SSE streaming SIMD extensionsndash AVX advanced vector extensions
bull Graphics Processor Units (GPUs)ndash Considered as SIMD accelerators
CA-Lec8 cwliutwinseenctuedutw
Introduction
3
SIMD vs MIMD
bull For x86 processorsbull Expect two additional cores per chip per year
bull SIMD width to double every four years
bull Potential speedup from SIMD to be twice that from MIMD
CA-Lec8 cwliutwinseenctuedutw 4
Vector Architecturesbull Basic idea
ndash Read sets of data elements into ldquovector registersrdquondash Operate on those registersndash Disperse the results back into memory
bull Registers are controlled by compilerndash Register files act as compiler controlled buffersndash Used to hide memory latencyndash Leverage memory bandwidth
bull Vector loadsstores deeply pipelinedndash Pay for memory latency once per vector loadstore
bull Regular loadsstoresndash Pay for memory latency for each vector element
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
5
Vector Supercomputers
bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer
ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound
problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer
bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions
bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units
CA-Lec8 cwliutwinseenctuedutw 6
Cray‐1 (1976)
Single PortMemory
16 banks of 64-bit words+ 8-bit SECDED
80MWsec data loadstore
320MWsec instructionbuffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 125 ns (80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V Mask
V Length64 Element Vector Registers
CA-Lec8 cwliutwinseenctuedutw 7
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
ADDV v3 v1 v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and Store
LV v1 r1 r2
Base r1 Stride r2 Memory
Vector Register
Vector Programming Model
CA-Lec8 cwliutwinseenctuedutw 8
Example VMIPS
CA-Lec8 cwliutwinseenctuedutw 9
bull Loosely based on Cray‐1bull Vector registers
ndash Each register holds a 64‐element 64 bitselement vector
ndash Register file has 16 read ports and 8 write ports
bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected
bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency
bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers
VMIPS Instructionsbull Operate on many elements concurrently
bull Allows use of slow but wide execution unitsbull High performance lower power
bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks
bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision
CA-Lec8 cwliutwinseenctuedutw 10
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
SIMD Variations
bull Vector architecturesbull SIMD extensions
ndash MMX multimedia extensions (1996)ndash SSE streaming SIMD extensionsndash AVX advanced vector extensions
bull Graphics Processor Units (GPUs)ndash Considered as SIMD accelerators
CA-Lec8 cwliutwinseenctuedutw
Introduction
3
SIMD vs MIMD
bull For x86 processorsbull Expect two additional cores per chip per year
bull SIMD width to double every four years
bull Potential speedup from SIMD to be twice that from MIMD
CA-Lec8 cwliutwinseenctuedutw 4
Vector Architecturesbull Basic idea
ndash Read sets of data elements into ldquovector registersrdquondash Operate on those registersndash Disperse the results back into memory
bull Registers are controlled by compilerndash Register files act as compiler controlled buffersndash Used to hide memory latencyndash Leverage memory bandwidth
bull Vector loadsstores deeply pipelinedndash Pay for memory latency once per vector loadstore
bull Regular loadsstoresndash Pay for memory latency for each vector element
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
5
Vector Supercomputers
bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer
ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound
problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer
bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions
bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units
CA-Lec8 cwliutwinseenctuedutw 6
Cray‐1 (1976)
Single PortMemory
16 banks of 64-bit words+ 8-bit SECDED
80MWsec data loadstore
320MWsec instructionbuffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 125 ns (80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V Mask
V Length64 Element Vector Registers
CA-Lec8 cwliutwinseenctuedutw 7
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
ADDV v3 v1 v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and Store
LV v1 r1 r2
Base r1 Stride r2 Memory
Vector Register
Vector Programming Model
CA-Lec8 cwliutwinseenctuedutw 8
Example VMIPS
CA-Lec8 cwliutwinseenctuedutw 9
bull Loosely based on Cray‐1bull Vector registers
ndash Each register holds a 64‐element 64 bitselement vector
ndash Register file has 16 read ports and 8 write ports
bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected
bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency
bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers
VMIPS Instructionsbull Operate on many elements concurrently
bull Allows use of slow but wide execution unitsbull High performance lower power
bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks
bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision
CA-Lec8 cwliutwinseenctuedutw 10
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
SIMD vs MIMD
bull For x86 processorsbull Expect two additional cores per chip per year
bull SIMD width to double every four years
bull Potential speedup from SIMD to be twice that from MIMD
CA-Lec8 cwliutwinseenctuedutw 4
Vector Architecturesbull Basic idea
ndash Read sets of data elements into ldquovector registersrdquondash Operate on those registersndash Disperse the results back into memory
bull Registers are controlled by compilerndash Register files act as compiler controlled buffersndash Used to hide memory latencyndash Leverage memory bandwidth
bull Vector loadsstores deeply pipelinedndash Pay for memory latency once per vector loadstore
bull Regular loadsstoresndash Pay for memory latency for each vector element
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
5
Vector Supercomputers
bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer
ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound
problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer
bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions
bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units
CA-Lec8 cwliutwinseenctuedutw 6
Cray‐1 (1976)
Single PortMemory
16 banks of 64-bit words+ 8-bit SECDED
80MWsec data loadstore
320MWsec instructionbuffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 125 ns (80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V Mask
V Length64 Element Vector Registers
CA-Lec8 cwliutwinseenctuedutw 7
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
ADDV v3 v1 v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and Store
LV v1 r1 r2
Base r1 Stride r2 Memory
Vector Register
Vector Programming Model
CA-Lec8 cwliutwinseenctuedutw 8
Example VMIPS
CA-Lec8 cwliutwinseenctuedutw 9
bull Loosely based on Cray‐1bull Vector registers
ndash Each register holds a 64‐element 64 bitselement vector
ndash Register file has 16 read ports and 8 write ports
bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected
bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency
bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers
VMIPS Instructionsbull Operate on many elements concurrently
bull Allows use of slow but wide execution unitsbull High performance lower power
bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks
bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision
CA-Lec8 cwliutwinseenctuedutw 10
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Architecturesbull Basic idea
ndash Read sets of data elements into ldquovector registersrdquondash Operate on those registersndash Disperse the results back into memory
bull Registers are controlled by compilerndash Register files act as compiler controlled buffersndash Used to hide memory latencyndash Leverage memory bandwidth
bull Vector loadsstores deeply pipelinedndash Pay for memory latency once per vector loadstore
bull Regular loadsstoresndash Pay for memory latency for each vector element
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
5
Vector Supercomputers
bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer
ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound
problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer
bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions
bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units
CA-Lec8 cwliutwinseenctuedutw 6
Cray‐1 (1976)
Single PortMemory
16 banks of 64-bit words+ 8-bit SECDED
80MWsec data loadstore
320MWsec instructionbuffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 125 ns (80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V Mask
V Length64 Element Vector Registers
CA-Lec8 cwliutwinseenctuedutw 7
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
ADDV v3 v1 v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and Store
LV v1 r1 r2
Base r1 Stride r2 Memory
Vector Register
Vector Programming Model
CA-Lec8 cwliutwinseenctuedutw 8
Example VMIPS
CA-Lec8 cwliutwinseenctuedutw 9
bull Loosely based on Cray‐1bull Vector registers
ndash Each register holds a 64‐element 64 bitselement vector
ndash Register file has 16 read ports and 8 write ports
bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected
bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency
bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers
VMIPS Instructionsbull Operate on many elements concurrently
bull Allows use of slow but wide execution unitsbull High performance lower power
bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks
bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision
CA-Lec8 cwliutwinseenctuedutw 10
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Supercomputers
bull In 70‐80s Supercomputer Vector machinebull Definition of supercomputer
ndash Fastest machine in the world at given taskndash A device to turn a compute‐bound problem into an IO‐bound
problemndash CDC6600 (Cray 1964) is regarded as the first supercomputer
bull Vector supercomputers (epitomized by Cray‐1 1976)ndash Scalar unit + vector extensions
bull Vector registers vector instructionsbull Vector loadsstoresbull Highly pipelined functional units
CA-Lec8 cwliutwinseenctuedutw 6
Cray‐1 (1976)
Single PortMemory
16 banks of 64-bit words+ 8-bit SECDED
80MWsec data loadstore
320MWsec instructionbuffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 125 ns (80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V Mask
V Length64 Element Vector Registers
CA-Lec8 cwliutwinseenctuedutw 7
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
ADDV v3 v1 v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and Store
LV v1 r1 r2
Base r1 Stride r2 Memory
Vector Register
Vector Programming Model
CA-Lec8 cwliutwinseenctuedutw 8
Example VMIPS
CA-Lec8 cwliutwinseenctuedutw 9
bull Loosely based on Cray‐1bull Vector registers
ndash Each register holds a 64‐element 64 bitselement vector
ndash Register file has 16 read ports and 8 write ports
bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected
bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency
bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers
VMIPS Instructionsbull Operate on many elements concurrently
bull Allows use of slow but wide execution unitsbull High performance lower power
bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks
bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision
CA-Lec8 cwliutwinseenctuedutw 10
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Cray‐1 (1976)
Single PortMemory
16 banks of 64-bit words+ 8-bit SECDED
80MWsec data loadstore
320MWsec instructionbuffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0S1S2S3S4S5S6S7
A0A1A2A3A4A5A6A7
Si
Tjk
Ai
Bjk
FP Add
FP Mul
FP Recip
Int Add
Int Logic
Int Shift
Pop Cnt
Sj
Si
Sk
Addr Add
Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 125 ns (80MHz)
V0V1V2V3V4V5V6V7
Vk
Vj
Vi V Mask
V Length64 Element Vector Registers
CA-Lec8 cwliutwinseenctuedutw 7
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
ADDV v3 v1 v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and Store
LV v1 r1 r2
Base r1 Stride r2 Memory
Vector Register
Vector Programming Model
CA-Lec8 cwliutwinseenctuedutw 8
Example VMIPS
CA-Lec8 cwliutwinseenctuedutw 9
bull Loosely based on Cray‐1bull Vector registers
ndash Each register holds a 64‐element 64 bitselement vector
ndash Register file has 16 read ports and 8 write ports
bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected
bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency
bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers
VMIPS Instructionsbull Operate on many elements concurrently
bull Allows use of slow but wide execution unitsbull High performance lower power
bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks
bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision
CA-Lec8 cwliutwinseenctuedutw 10
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
ADDV v3 v1 v2 v3
v2v1
Scalar Registers
r0
r15Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLRVector Length Register
v1Vector Load and Store
LV v1 r1 r2
Base r1 Stride r2 Memory
Vector Register
Vector Programming Model
CA-Lec8 cwliutwinseenctuedutw 8
Example VMIPS
CA-Lec8 cwliutwinseenctuedutw 9
bull Loosely based on Cray‐1bull Vector registers
ndash Each register holds a 64‐element 64 bitselement vector
ndash Register file has 16 read ports and 8 write ports
bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected
bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency
bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers
VMIPS Instructionsbull Operate on many elements concurrently
bull Allows use of slow but wide execution unitsbull High performance lower power
bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks
bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision
CA-Lec8 cwliutwinseenctuedutw 10
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Example VMIPS
CA-Lec8 cwliutwinseenctuedutw 9
bull Loosely based on Cray‐1bull Vector registers
ndash Each register holds a 64‐element 64 bitselement vector
ndash Register file has 16 read ports and 8 write ports
bull Vector functional unitsndash Fully pipelinedndash Data and control hazards are detected
bull Vector load‐store unitndash Fully pipelinedndash Words move between registersndash One word per clock cycle after initial latency
bull Scalar registersndash 32 general‐purpose registersndash 32 floating‐point registers
VMIPS Instructionsbull Operate on many elements concurrently
bull Allows use of slow but wide execution unitsbull High performance lower power
bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks
bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision
CA-Lec8 cwliutwinseenctuedutw 10
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
VMIPS Instructionsbull Operate on many elements concurrently
bull Allows use of slow but wide execution unitsbull High performance lower power
bull Independence of elements within a vector instructionbull Allows scaling of functional units without costly dependence checks
bull Flexiblebull 64 64‐bit 128 32‐bit 256 16‐bit 512 8‐bitbull Matches the need of multimedia (8bit) scientific applications that require high precision
CA-Lec8 cwliutwinseenctuedutw 10
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Instructionsbull ADDVVD add two vectorsbull ADDVSD add vector to a scalarbull LVSV vector load and vector store from addressbull Vector code example
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
11
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Memory‐Memory vsVector Register Machines
bull Vector memory‐memory instructions hold all vector operands in main memory
bull The first vector machines CDC Star‐100 (lsquo73) and TI ASC (lsquo71) were memory‐memory machines
bull Cray‐1 (rsquo76) was first vector register machine
for (i=0 iltN i++)C[i] = A[i] + B[i]D[i] = A[i] - B[i]
Example Source Code ADDV C A BSUBV D A B
Vector Memory-Memory Code
LV V1 ALV V2 BADDV V3 V1 V2SV V3 CSUBV V4 V1 V2SV V4 D
Vector Register Code
CA-Lec8 cwliutwinseenctuedutw 12
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Memory‐Memory vs Vector Register Machines
bull Vector memory‐memory architectures (VMMA) require greater main memory bandwidth whyndash All operands must be read in and out of memory
bull VMMAs make if difficult to overlap execution of multiple vector operations why ndash Must check dependencies on memory addresses
bull VMMAs incur greater startup latencyndash Scalar code was faster on CDC Star‐100 for vectors lt 100
elementsndash For Cray‐1 vectorscalar breakeven point was around 2
elements Apart from CDC follow‐ons (Cyber‐205 ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
(we ignore vector memory‐memory from now on)
CA-Lec8 cwliutwinseenctuedutw 13
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Instruction Set Advantagesbull Compact
ndash One short instruction encodes N operationsbull Expressive
ndash tells hardware that these N operations are independentndash N operations use the same functional unitndash N operations access disjoint registersndash N operations access registers in the same pattern as previous
instructionndash N operations access a contiguous block of memory (unit‐stride
loadstore)ndash N operations access memory in a known pattern (stridden loadstore)
bull Scalablendash Can run same object code on more parallel pipelines or lanes
CA-Lec8 cwliutwinseenctuedutw 14
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Instructions Examplebull Example DAXPY
LD F0a load scalar aLV V1Rx load vector XMULVSD V2V1F0 vector-scalar multLV V3Ry load vector YADDVV V4V2V3 addSV RyV4 store result
bull In MIPS Codebull ADD waits for MUL SD waits for ADD
bull In VMIPSbull Stall once for the first vector element subsequent elements will flow
smoothly down the pipeline bull Pipeline stall required once per vector instruction
CA-Lec8 cwliutwinseenctuedutw 15
adds a scalar multiple of a double precision vector to another double precision vector
Requires 6 instructions only
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Challenges of Vector Instructionsbull Start up time
ndash Application and architecture must support long vectors Otherwise they will run out of instructions requiring ILP
ndash Latency of vector functional unitndash Assume the same as Cray‐1
bull Floating‐point add =gt 6 clock cyclesbull Floating‐point multiply =gt 7 clock cyclesbull Floating‐point divide =gt 20 clock cyclesbull Vector load =gt 12 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
16
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
V1 V2 V3
V3 lt- v1 v2
Six stage multiply pipeline
Vector Arithmetic Execution
bull Use deep pipeline (=gt fast clock) to execute element operations
bull Simplifies control of deep pipeline because elements in vector are independent (=gt no hazards)
CA-Lec8 cwliutwinseenctuedutw 17
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Instruction Execution
CA-Lec8 cwliutwinseenctuedutw 18
ADDV CAB
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]A[16] B[16]A[20] B[20]A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]A[17] B[17]A[21] B[21]A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]A[18] B[18]A[22] B[22]A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]A[19] B[19]A[23] B[23]A[27] B[27]
Four-laneexecution using four pipelined functional units
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Unit Structure
Lane
Functional Unit
VectorRegisters
Memory Subsystem
Elements 0 4 8 hellip
Elements 1 5 9 hellip
Elements 2 6 10 hellip
Elements 3 7 11 hellip
CA-Lec8 cwliutwinseenctuedutw 19
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Multiple Lanes Architecturebull Beyond one element per clock cycle
bull Elements n of vector register A is hardwired to element n of vector Bndash Allows for multiple hardware lanes
ndash No communication between lanes
ndash Little increase in control overhead
ndash No need to change machine code
CA-Lec8 cwliutwinseenctuedutw 20
Adding more lanes allows designers to tradeoff clock rate and energy without sacrificing performance
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Code Vectorization
CA-Lec8 cwliutwinseenctuedutw 21
for (i=0 i lt N i++)C[i] = A[i] + B[i]
load
load
add
store
load
load
add
store
Iter 1
Iter 2
Scalar Sequential Code
Vectorization is a massive compile-time reordering of operation sequencing requires extensive loop dependence analysis
Vector Instruction
load
load
add
store
load
load
add
store
Iter 1 Iter 2
Vectorized Code
Tim
e
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Execution Timebull Execution time depends on three factors
ndash Length of operand vectorsndash Structural hazardsndash Data dependencies
bull VMIPS functional units consume one element per clock cyclendash Execution time is approximately the vector length
bull Convoyndash Set of vector instructions that could potentially execute together
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
22
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
load
Vector Instruction ParallelismCan overlap execution of multiple vector instructions
ndash example machine has 32 elements per vector register and 8 lanes
loadmul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Complete 24 operationscycle while issuing 1 short instructioncycle
CA-Lec8 cwliutwinseenctuedutw 23
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Convoy
bull Convoy set of vector instructions that could potentially execute togetherndash Must not contain structural hazardsndash Sequences with RAW hazards should be in different convoys
bull However sequences with RAW hazards can be in the same convey via chaining
CA-Lec8 cwliutwinseenctuedutw 24
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Chainingbull Vector version of register bypassing
ndash Allows a vector operation to start as soon as the individual elements of its vector source operand become available
Memory
V1
Load Unit
Mult
V2 V3
Chain
Add
V4 V5
Chain
LV v1MULV v3 v1v2ADDV v5 v3 v4
CA-Lec8 cwliutwinseenctuedutw 25
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Chaining Advantage
CA-Lec8 cwliutwinseenctuedutw 26
bull With chaining can start dependent instruction as soon as first result appears
LoadMul
Add
LoadMul
AddTime
bull Without chaining must wait for last element of result to be written before starting dependent instruction
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Chimes
bull Chimes unit of time to execute one convoyndash m convoy executes in m chimesndash For vector length of n requires m n clock cycles
CA-Lec8 cwliutwinseenctuedutw 27
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
ExampleLV V1Rx load vector XMULVSD V2V1F0 vector‐scalar multiplyLV V3Ry load vector YADDVVD V4V2V3 add two vectorsSV RyV4 store the sum
Convoys1 LV MULVSD2 LV ADDVVD3 SV
3 chimes 2 FP ops per result cycles per FLOP = 15For 64 element vectors requires 64 x 3 = 192 clock cycles
CA-Lec8 cwliutwinseenctuedutw
Vector Architectures
28
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Strip‐mining
ANDI R1 N 63 N mod 64MTC1 VLR R1 Do remainderloopLV V1 RADSLL R2 R1 3 Multiply by 8 DADDU RA RA R2 Bump pointerLV V2 RBDADDU RB RB R2 ADDVD V3 V1 V2SV V3 RCDADDU RC RC R2DSUBU N N R1 Subtract elementsLI R1 64MTC1 VLR R1 Reset full lengthBGTZ N loop Any more to do
for (i=0 iltN i++)C[i] = A[i]+B[i]
+
+
+
A B C
64 elements
Remainder
Problem Vector registers have finite lengthSolution Break loops into pieces that fit into vector registers ldquoStrip‐miningrdquo
CA-Lec8 cwliutwinseenctuedutw 29
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Length Register
bull Handling loops not equal to 64bull Vector length not known at compile time Use Vector Length
Register (VLR) for vectors over the maximum length strip mininglow = 0VL = (n MVL) find odd‐size piece using modulo op for (j = 0 j lt= (nMVL) j=j+1) outer loop
for (i = low i lt (low+VL) i=i+1) runs for length VLY[i] = a X[i] + Y[i] main operation
low = low + VL start of next vectorVL = MVL reset the length to maximum vector length
CA-Lec8 cwliutwinseenctuedutw 30
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Maximum Vector Length (MVL)
bull Determine the maximum number of elements in a vector for a given architecture
bull All blocks but the first are of length MVLndash Utilize the full power of the vector processor
bull Later generations may grow the MVLndash No need to change the ISA
CA-Lec8 cwliutwinseenctuedutw 31
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector‐Mask Control
CA-Lec8 cwliutwinseenctuedutw 32
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
Density-Time Implementationndash scan mask vector and only execute
elements with non-zero masks
C[1]
C[2]
C[0]
A[3] B[3]A[4] B[4]A[5] B[5]A[6] B[6]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data portWrite Disable
A[7] B[7]M[7]=1
Simple Implementationndash execute all N operations turn off result
writeback according to mask
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Mask Register (VMR)
bull A Boolean vector to control the execution of a vector instruction
bull VMR is part of the architectural statebull For vector processor it relies on compilers to manipulate VMR explicitly
bull For GPU it gets the same effect using hardwarendash Invisible to SW
bull Both GPU and vector processor spend time on masking
CA-Lec8 cwliutwinseenctuedutw 33
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Mask Registers Examplebull Programs that contain IF statements in loops cannot be run in vector
processorbull Consider
for (i = 0 i lt 64 i=i+1)if (X[i] = 0)
X[i] = X[i] ndash Y[i]bull Use vector mask register to ldquodisablerdquo elements
LV V1Rx load vector X into V1LV V2Ry load vector YLD F00 load FP zero into F0SNEVSD V1F0 sets VM(i) to 1 if V1(i)=F0SUBVVD V1V1V2 subtract under vector maskSV RxV1 store the result in X
bull GFLOPS rate decreases
CA-Lec8 cwliutwinseenctuedutw 34
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
CompressExpand Operationsbull Compress packs non‐masked elements from one vector register
contiguously at start of destination vector registerndash population count of mask vector gives packed vector length
bull Expand performs inverse operation
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
A[3]A[4]A[5]A[6]A[7]
A[0]A[1]A[2]
M[3]=0M[4]=1M[5]=1M[6]=0
M[2]=0M[1]=1M[0]=0
M[7]=1
B[3]A[4]A[5]B[6]A[7]
B[0]A[1]B[2]
Expand
A[7]
A[1]A[4]A[5]
Compress
A[7]
A[1]A[4]A[5]
Used for density-time conditionals and also for general selection operations CA-Lec8 cwliutwinseenctuedutw 35
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Memory Banksbull The start‐up time for a loadstore vector unit is the time to get the first
word from memory into a register How about the rest of the vectorndash Memory stalls can reduce effective throughput for the rest of the vector
bull Penalties for start‐ups on loadstore units are higher than those for arithmetic unit
bull Memory system must be designed to support high bandwidth for vector loads and storesndash Spread accesses across multiple banks1 Support multiple loads or stores per clock Be able to control the addresses
to the banks independently2 Support (multiple) non‐sequential loads or stores3 Support multiple processors sharing the same memory system so each
processor will be generating its own independent stream of addresses
CA-Lec8 cwliutwinseenctuedutw 36
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Example (Cray T90)bull 32 processors each generating 4 loads and 2 stores per
cyclebull Processor cycle time is 2167nsbull SRAM cycle time is 15nsbull How many memory banks needed
bull Solution1 The maximum number of memory references each cycle is
326=1922 SRAM takes 152167=6927 processor cycles3 It requires 1927=1344 memory banks at full memory
bandwidth
CA-Lec8 cwliutwinseenctuedutw 37
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Memory Addressingbull Loadstore operations move groups of data between
registers and memorybull Three types of addressing
ndash Unit stridebull Contiguous block of information in memorybull Fastest always possible to optimize this
ndash Non‐unit (constant) stridebull Harder to optimize memory system for all possible stridesbull Prime number of data banks makes it easier to support different strides at full bandwidth
ndash Indexed (gather‐scatter)bull Vector equivalent of register indirectbull Good for sparse arrays of databull Increases number of programs that vectorize
CA-Lec8 cwliutwinseenctuedutw 38
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Interleaved Memory Layout
bull Great for unit stride ndash Contiguous elements in different DRAMsndash Startup time for vector operation is latency of single read
bull What about non‐unit stridendash Above good for strides that are relatively prime to 8ndash Bad for 2 4
Vector Processor
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
UnpipelinedD
RAM
AddrMod 8= 0
AddrMod 8= 1
AddrMod 8= 2
AddrMod 8= 4
AddrMod 8= 5
AddrMod 8= 3
AddrMod 8= 6
AddrMod 8= 7
CA-Lec8 cwliutwinseenctuedutw 39
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Handling Multidimensional Arrays in Vector Architectures
bull Considerfor (i = 0 i lt 100 i=i+1)
for (j = 0 j lt 100 j=j+1) A[i][j] = 00for (k = 0 k lt 100 k=k+1)A[i][j] = A[i][j] + B[i][k] D[k][j]
bull Must vectorize multiplications of rows of B with columns of D
ndash Need to access adjacent elements of B and adjacent elements of Dndash Elements of B accessed in row‐major order but elements of D in
column‐major orderbull Once vector is loaded into the register it acts as if it has logically
adjacent elements
CA-Lec8 cwliutwinseenctuedutw 40
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
(UnitNon‐Unit) Stride Addressingbull The distance separating elements to be gathered into a single
register is called stridebull In the example
ndash Matrix D has a stride of 100 double wordsndash Matrix B has a stride of 1 double word
bull Use non‐unit stride for Dndash To access non‐sequential memory location and to reshape them into a
dense structurebull The size of the matrix may not be known at compile time
ndash Use LVWSSVWS loadstore vector with stridebull The vector stride like the vector starting address can be put in a general‐
purpose register (dynamic)bull Cache inherently deals with unit stride data
ndash Blocking techniques helps non‐unit stride data
CA-Lec8 cwliutwinseenctuedutw 41
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
How to get full bandwidth for unit stride
bull Memory system must sustain ( lanes x word) clockbull memory banks gt memory latency to avoid stallsbull If desired throughput greater than one word per cycle
ndash Either more banks (start multiple requests simultaneously)ndash Or wider DRAMS Only good for unit stride or large data types
bull memory banks gt memory latency to avoid stallsbull More numbers of banks good to support more strides at full
bandwidthndash can read paper on how to do prime number of banks efficiently
CA-Lec8 cwliutwinseenctuedutw 42
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Memory Bank Conflicts
bull Even with 128 banks since 512 is multiple of 128 conflict on word accessesbull SW loop interchange or declaring array not power of 2 (ldquoarray paddingrdquo)bull HW Prime number of banks
ndash bank number = address mod number of banksndash address within bank = address number of words in bankndash modulo amp divide per memory access with prime no banksndash address within bank = address mod number words in bankndash bank number easy if 2N words per bank
CA-Lec8 cwliutwinseenctuedutw 43
int x[256][512]for (j = 0 j lt 512 j = j+1)
for (i = 0 i lt 256 i = i+1)x[i][j] = 2 x[i][j]
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Problems of Stride Addressing
bull Once we introduce non‐unit strides it becomes possible to request accesses from the same bank frequently
ndash Memory bank conflict
bull Stall the other request to solve bank conflict
bull Bank conflict (stall) occurs when the same bank is hit faster than bank busy time
ndash banks LCM(stridebanks) lt bank busy time
CA-Lec8 cwliutwinseenctuedutw 44
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Examplebull 8 memory banks Bank busy time 6 cycles Total memory
latency 12 cycles for initializedbull What is the difference between a 64‐element vector load
with a stride of 1 and 32
bull Solution1 Since 8 gt 6 for a stride of 1 the load will take 12+64=76
cycles ie 12 cycles per element2 Since 32 is a multiple of 8 the worst possible stride
ndash every access to memory (after the first one) will collide with the previous access and will have to wait for 6 cycles
ndash The total time will take 12+1+636=391 cycles ie 61 cycles per element
CA-Lec8 cwliutwinseenctuedutw 45
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Handling Sparse Matrices in Vector architecture
bull Sparse matrices in vector mode is a necessitybull Example Consider a sparse vector sum on arrays A and C
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
ndash where K and M and index vectors to designate the nonzero elements of A and C
bull Sparse matrix elements stored in a compact form and accessed indirectly
bull Gather‐scatter operations are usedndash Using LVISVI loadstore vector indexed or gathered
CA-Lec8 cwliutwinseenctuedutw 46
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Scatter‐Gatherbull Consider
for (i = 0 i lt n i=i+1)A[K[i]] = A[K[i]] + C[M[i]]
bull Ra Rc Rk and Rm contain the starting addresses of the vectors
bull Use index vectorLV Vk Rk load KLVI Va (Ra+Vk) load A[K[]]LV Vm Rm load MLVI Vc (Rc+Vm) load C[M[]]ADDVVD Va Va Vc add themSVI (Ra+Vk) Va store A[K[]]
bull A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 cwliutwinseenctuedutw 47
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Vector Architecture Summarybull Vector is alternative model for exploiting ILP
ndash If code is vectorizable then simpler hardware energy efficient and better real‐time model than out‐of‐order
bull More lanes slower clock rate
ndash Scalable if elements are independentndash If there is dependency
bull One stall per vector instruction rather than one stall per vector element
bull Programmer in charge of giving hints to the compilerbull Design issues number of lanes functional units and registers length of
vector registers exception handling conditional operationsbull Fundamental design issue is memory bandwidth
ndash Especially with virtual address translation and caching
CA-Lec8 cwliutwinseenctuedutw 48
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Programming Vector Architecturesbull Compilers can provide feedback to programmersbull Programmers can provide hints to compilerbull Cray Y‐MP Benchmarks
CA-Lec8 cwliutwinseenctuedutw 49
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Multimedia Extensionsbull Very short vectors added to existing ISAsbull Usually 64‐bit registers split into 2x32b or 4x16b or 8x8bbull Newer designs have 128‐bit registers (Altivec SSE2)bull Limited instruction set
ndash no vector length controlndash no loadstore stride or scattergatherndash unit‐stride loads must be aligned to 64128‐bit boundary
bull Limited vector register lengthndash requires superscalar dispatch to keep multiplyaddload units
busyndash loop unrolling to hide latencies increases register pressure
bull Trend towards fuller vector support in microprocessors
CA-Lec8 cwliutwinseenctuedutw 50
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
ldquoVectorrdquo for Multimediabull Intel MMX 57 additional 80x86 instructions (1st since 386)
ndash similar to Intel 860 Mot 88110 HP PA‐71000LC UltraSPARCbull 3 data types 8 8‐bit 4 16‐bit 2 32‐bit in 64bits
ndash reuse 8 FP registers (FP and MMX cannot mix)bull short vector load add store 8 8‐bit operands
bull Claim overall speedup 15 to 2X for 2D3D graphics audio video speech comm ndash use in drivers or added to library routines no compiler
+
CA-Lec8 cwliutwinseenctuedutw 51
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
MMX Instructionsbull Move 32b 64bbull Add Subtract in parallel 8 8b 4 16b 2 32b
ndash opt signedunsigned saturate (set to max) if overflowbull Shifts (sllsrl sra) And And Not Or Xorin parallel 8 8b 4 16b 2 32b
bull Multiply Multiply‐Add in parallel 4 16bbull Compare = gt in parallel 8 8b 4 16b 2 32b
ndash sets field to 0s (false) or 1s (true) removes branchesbull PackUnpack
ndash Convert 32bltndashgt 16b 16b ltndashgt 8bndash Pack saturates (set to max) if number is too large
CA-Lec8 cwliutwinseenctuedutw 52
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
SIMD Implementations IA32AMD64
bull Intel MMX (1996)ndash Repurpose 64‐bit floating point registersndash Eight 8‐bit integer ops or four 16‐bit integer ops
bull Streaming SIMD Extensions (SSE) (1999)ndash Separate 128‐bit registersndash Eight 16‐bit integer ops Four 32‐bit integerfp ops or two 64‐bit
integerfp opsndash Single‐precision floating‐point arithmetic
bull SSE2 (2001) SSE3 (2004) SSE4(2007)ndash Double‐precision floating‐point arithmetic
bull Advanced Vector Extensions (2010)ndash 256‐bits registersndash Four 64‐bit integerfp opsndash Extensible to 512 and 1024 bits for future generations
CA-Lec8 cwliutwinseenctuedutw 53
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
SIMD Implementations IBMbull VMX (1996‐1998)
ndash 32 4b 16 8b 8 16b 4 32b integer ops and 4 32b FP opsndash Data rearrangement
bull Cell SPE (PS3)ndash 16 8b 8 16b 4 32b integer ops and 4 32b and 8 64b FP opsndash Unified vectorscalar execution with 128 registers
bull VMX 128 (Xbox360)ndash Extension to 128 registers
bull VSX (2009)ndash 1 or 2 64b FPU 4 32b FPUndash Integrate FPU and VMX into unit with 64 registers
bull QPX (2010 Blue Gene)ndash Four 64b SP or DP FP
CA-Lec8 cwliutwinseenctuedutw 54
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Why SIMD Extensions
bull Media applications operate on data types narrower than the native word size
bull Costs little to add to the standard arithmetic unitbull Easy to implementbull Need smaller memory bandwidth than vectorbull Separate data transfer aligned in memory
ndash Vector single instruction 64 memory accesses page fault in the middle of the vector likely
bull Use much smaller register spacebull Fewer operandsbull No need for sophisticated mechanisms of vector architecture
CA-Lec8 cwliutwinseenctuedutw 55
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Example SIMD Codebull Example DXPY
LD F0a load scalar aMOV F1 F0 copy a into F1 for SIMD MULMOV F2 F0 copy a into F2 for SIMD MULMOV F3 F0 copy a into F3 for SIMD MULDADDIU R4Rx512 last address to load
Loop L4D F40[Rx] load X[i] X[i+1] X[i+2] X[i+3]MUL4D F4F4F0 atimesX[i]atimesX[i+1]atimesX[i+2]atimesX[i+3]L4D F80[Ry] load Y[i] Y[i+1] Y[i+2] Y[i+3]ADD4D F8F8F4 atimesX[i]+Y[i] atimesX[i+3]+Y[i+3]S4D 0[Ry]F8 store into Y[i] Y[i+1] Y[i+2] Y[i+3]DADDIU RxRx32 increment index to XDADDIU RyRy32 increment index to YDSUBU R20R4Rx compute boundBNEZ R20Loop check if done
CA-Lec8 cwliutwinseenctuedutw 56
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Challenges of SIMD Architecturesbull Scalar processor memory architecture
ndash Only access to contiguous databull No efficient scattergather accesses
ndash Significant penalty for unaligned memory accessndash May need to write entire vector register
bull Limitations on data access patternsndash Limited by cache line up to 128‐256b
bull Conditional executionndash Register renaming does not work well with masked executionndash Always need to write whole registerndash Difficult to know when to indicate exceptions
bull Register pressurendash Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 cwliutwinseenctuedutw 57
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Roofline Performance Modelbull Basic idea
ndash Plot peak floating‐point throughput as a function of arithmetic intensity
ndash Ties together floating‐point performance and memory performance for a target machine
bull Arithmetic intensityndash Floating‐point operations per byte read
CA-Lec8 cwliutwinseenctuedutw 58
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Examples
bull Attainable GFLOPssec
CA-Lec8 cwliutwinseenctuedutw 59
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
History of GPUs
bull Early video cardsndash Frame buffer memory with address generation for video output
bull 3D graphics processingndash Originally high‐end computers (eg SGI)ndash 3D graphics cards for PCs and game consoles
bull Graphics Processing Unitsndash Processors oriented to 3D graphics tasksndash Vertexpixel processing shading texture mapping ray tracing
CA-Lec8 cwliutwinseenctuedutw 60
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Graphical Processing Units
bull Basic ideandash Heterogeneous execution model
bull CPU is the host GPU is the device
ndash Develop a C‐like programming language for GPUndash Unify all forms of GPU parallelism as CUDA threadndash Programming model is ldquoSingle Instruction Multiple Threadrdquo
CA-Lec8 cwliutwinseenctuedutw 61
Graphical P
rocessing Units
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Programming Model
bull CUDArsquos design goalsndash extend a standard sequential programming language specifically CC++
bull focus on the important issues of parallelismmdashhow to craft efficient parallel algorithmsmdashrather than grappling with the mechanics of an unfamiliar and complicated language
ndash minimalist set of abstractions for expressing parallelismbull highly scalable parallel code that can run across tens of thousands of concurrent threads and hundreds of processor cores
CA-Lec8 cwliutwinseenctuedutw 62
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
NVIDIA GPU Architecture
bull Similarities to vector machinesndash Works well with data‐level parallel problemsndash Scatter‐gather transfers from memory into local storendash Mask registersndash Large register files
bull Differencesndash No scalar processor scalar integrationndash Uses multithreading to hide memory latencyndash Has many functional units as opposed to a few deeply pipelined units like a vector processor
CA-Lec8 cwliutwinseenctuedutw 63
Graphical P
rocessing Units
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Programming the GPUbull CUDA Programming Model
ndash Single Instruction Multiple Thread (SIMT)bull A thread is associated with each data elementbull Threads are organized into blocksbull Blocks are organized into a grid
bull GPU hardware handles thread management not applications or OSndash Given the hardware invested to do graphics well how can we supplement it to improve performance of a wider range of applications
CA-Lec8 cwliutwinseenctuedutw 64
Graphical P
rocessing Units
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Example
bull Multiply two vectors of length 8192ndash Code that works over all elements is the gridndash Thread blocks break this down into manageable sizes
bull 512 threads per blockndash SIMD instruction executes 32 elements at a timendash Thus grid size = 16 blocksndash Block is analogous to a strip‐mined vector loop with vector length of 32
ndash Block is assigned to a multithreaded SIMD processor by the thread block scheduler
ndash Current‐generation GPUs (Fermi) have 7‐15 multithreaded SIMD processors
CA-Lec8 cwliutwinseenctuedutw 65
Graphical P
rocessing Units
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Terminology
bull Threads of SIMD instructionsndash Each has its own PCndash Thread scheduler uses scoreboard to dispatchndash No data dependencies between threadsndash Keeps track of up to 48 threads of SIMD instructions
bull Hides memory latencybull Thread block scheduler schedules blocks to SIMD processors
bull Within each SIMD processorndash 32 SIMD lanesndash Wide and shallow compared to vector processors
CA-Lec8 cwliutwinseenctuedutw 66
Graphical P
rocessing Units
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Example
bull NVIDIA GPU has 32768 registersndash Divided into lanesndash Each SIMD thread is limited to 64 registersndash SIMD thread has up to
bull 64 vector registers of 32 32‐bit elementsbull 32 vector registers of 32 64‐bit elements
ndash Fermi has 16 physical SIMD lanes each containing 2048 registers
CA-Lec8 cwliutwinseenctuedutw 67
Graphical P
rocessing Units
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
GTX570 GPUGlobal Memory
1280MB
Texture Cache8KB
Constant Cache8KB
Shared Memory48KB
L1 Cache16KB
L2 Cache640KB
SM 0
Registers32768
Shared Memory48KB
SM 14
Registers32768
32 cores
32cores
Up to 1536 ThreadsSM
CA-Lec8 cwliutwinseenctuedutw 68
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
GPU Threads in SM (GTX570)
bull 32 threads within a block work collectively Memory access optimization latency hiding
CA-Lec8 cwliutwinseenctuedutw 69
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Matrix Multiplication
CA-Lec8 cwliutwinseenctuedutw 70
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Programming the GPU
CA-Lec8 cwliutwinseenctuedutw 71
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Matrix Multiplicationbull For a 4096x4096matrix multiplication
‐Matrix C will require calculation of 16777216matrix cells bull On the GPU each cell is calculated by its own threadbull We can have 23040 active threads (GTX570) which means
we can have this many matrix cells calculated in parallelbull On a general purpose processor we can only calculate one cell
at a timebull Each thread exploits the GPUs fine granularity by computing
one element of Matrix Cbull Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU bandwidth
CA-Lec8 cwliutwinseenctuedutw 72
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Programming the GPU
bull Distinguishing execution place of functions _device_ or _global_ =gt GPU Device Variables declared are allocated to the GPU memory
_host_ =gt System processor (HOST)bull Function call NameltltdimGrid dimBlockgtgt(parameter list) blockIdx block identifier threadIdx threads per block identifier blockDim threads per block
CA-Lec8 cwliutwinseenctuedutw 73
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
CUDA Program ExampleInvoke DAXPYdaxpy(n20xy)
DAXPY in Cvoid daxpy(int n double a double x double y)
for (int i=0iltni++)y[i]= ax[i]+ y[i]
Invoke DAXPY with 256 threads per Thread Block_host_int nblocks = (n+255)256daxpyltltltnblocks 256gtgtgt (n20xy)
DAXPY in CUDA_device_void daxpy(int ndouble adouble xdouble y)
int i=blockIDxxblockDimx+threadIdxxif (iltn)
y[i]= ax[i]+ y[i]
CA-Lec8 cwliutwinseenctuedutw 74
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
NVIDIA Instruction Set Archbull ldquoParallel Thread Execution (PTX)rdquobull Uses virtual registersbull Translation to machine code is performed in softwarebull Example
shls32 R8 blockIdx 9 Thread Block ID Block size (512 or 29)adds32 R8 R8 threadIdx R8 = i = my CUDA thread IDldglobalf64 RD0 [X+R8] RD0 = X[i]ldglobalf64 RD2 [Y+R8] RD2 = Y[i]mulf64 R0D RD0 RD4 Product in RD0 = RD0 RD4 (scalar a)addf64 R0D RD0 RD2 Sum in RD0 = RD0 + RD2 (Y[i])stglobalf64 [Y+R8] RD0 Y[i] = sum (X[i]a + Y[i])
CA-Lec8 cwliutwinseenctuedutw 75
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Conditional Branching
bull Like vector architectures GPU branch hardware uses internal masksbull Also uses
ndash Branch synchronization stackbull Entries consist of masks for each SIMD lanebull Ie which threads commit their results (all threads execute)
ndash Instruction markers to manage when a branch diverges into multiple execution paths
bull Push on divergent branch
ndash hellipand when paths convergebull Act as barriersbull Pops stack
bull Per‐thread‐lane 1‐bit predicate register specified by programmer
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
76
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Exampleif (X[i] = 0)
X[i] = X[i] ndash Y[i]else X[i] = Z[i]
ldglobalf64 RD0 [X+R8] RD0 = X[i]setpneqs32 P1 RD0 0 P1 is predicate register 1P1 bra ELSE1 Push Push old mask set new mask bits
if P1 false go to ELSE1ldglobalf64 RD2 [Y+R8] RD2 = Y[i]subf64 RD0 RD0 RD2 Difference in RD0stglobalf64 [X+R8] RD0 X[i] = RD0P1 bra ENDIF1 Comp complement mask bits
if P1 true go to ENDIF1ELSE1 ldglobalf64 RD0 [Z+R8] RD0 = Z[i]
stglobalf64 [X+R8] RD0 X[i] = RD0ENDIF1 ltnext instructiongt Pop pop to restore old mask
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
77
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
NVIDIA GPU Memory Structures
bull Each SIMD Lane has private section of off‐chip DRAMndash ldquoPrivate memoryrdquondash Contains stack frame spilling registers and private variables
bull Each multithreaded SIMD processor also has local memoryndash Shared by SIMD lanes threads within a block
bull Memory shared by SIMD processors is GPU Memoryndash Host can read and write GPU memory
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
78
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Fermi Architecture Innovationsbull Each SIMD processor has
ndash Two SIMD thread schedulers two instruction dispatch unitsndash 16 SIMD lanes (SIMD width=32 chime=2 cycles) 16 load‐store units 4
special function unitsndash Thus two threads of SIMD instructions are scheduled every two clock
cyclesbull Fast double precisionbull Caches for GPU memorybull 64‐bit addressing and unified address spacebull Error correcting codesbull Faster context switchingbull Faster atomic instructions
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
79
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Fermi Multithreaded SIMD Proc
CA-Lec8 cwliutwinseenctuedutw
Graphical P
rocessing Units
80
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Compiler Technology for Loop‐Level Parallelism
bull Loop‐carried dependencendash Focuses on determining whether data accesses in later iterations are dependent on data values produced in earlier iterations
bull Loop‐level parallelism has no loop‐carried dependence
bull Example 1for (i=999 igt=0 i=i‐1)
x[i] = x[i] + s
bull No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
81
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Example 2 for Loop‐Level Parallelism
bull Example 2for (i=0 ilt100 i=i+1)
A[i+1] = A[i] + C[i] S1 B[i+1] = B[i] + A[i+1] S2
bull S1 and S2 use values computed by S1 in previous iterationndash Loop‐carried dependence
bull S2 uses value computed by S1 in same iterationndash No loop‐carried dependence
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
82
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Remarks
bull Intra‐loop dependence is not loop‐carried dependencendash A sequence of vector instructions that uses chainingexhibits exactly intra‐loop dependence
bull Two types of S1‐S2 intra‐loop dependencendash Circular S1 depends on S2 and S2 depends on S1ndash Not circular neither statement depends on itself and although S1 depends on S2 S2 does not depend on S1
bull A loop is parallel if it can be written without a cycle in the dependencesndash The absence of a cycle means that the dependences give a partial ordering on the statements
CA-Lec8 cwliutwinseenctuedutw 83
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Example 3 for Loop‐Level Parallelism (1)
bull Example 3for (i=0 ilt100 i=i+1)
A[i] = A[i] + B[i] S1 B[i+1] = C[i] + D[i] S2
bull S1 uses value computed by S2 in previous iteration but this dependence is not circular
bull There is no dependence from S1 to S2 interchanging the two statements will not affect the execution of S2
CA-Lec8 cwliutwinseenctuedutw 84
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Example 3 for Loop‐Level Parallelism (2)
bull Transform to A[0] = A[0] + B[0]for (i=0 ilt99 i=i+1)
B[i+1] = C[i] + D[i] S2A[i+1] = A[i+1] + B[i+1] S1
B[100] = C[99] + D[99]
bull The dependence between the two statements is no longer loop carried
CA-Lec8 cwliutwinseenctuedutw 85
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
More on Loop‐Level Parallelism
for (i=0ilt100i=i+1) A[i] = B[i] + C[i]D[i] = A[i] E[i]
bull The second reference to A needs not be translated to a load
instructionndash The two reference are the same There is no intervening memory
access to the same location
bull A more complex analysis ie Loop‐carried dependence analysis + data dependence analysis can be applied in the same basic block to optimize
CA-Lec8 cwliutwinseenctuedutw 86
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Recurrence
bull Recurrence is a special form of loop‐carried dependence bull Recurrence Example
for (i=1ilt100i=i+1) Y[i] = Y[i‐1] + Y[i]
bull Detecting a recurrence is importantndash Some vector computers have special support for executing recurrence
ndash It my still be possible to exploit a fair amount of ILP
CA-Lec8 cwliutwinseenctuedutw 87
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Compiler Technology for Finding Dependences
bull To determine which loop might contain parallelism (ldquoinexactrdquo) and to eliminate name dependences
bull Nearly all dependence analysis algorithms work on the assumption that array indices are affine ndash A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)ndash The index of a multi‐dimensional array is affine if the index in each dimension
is affinendash Non‐affine access example x[ y[i] ]
bull Determining whether there is a dependence between two references to the same array in a loop is equivalent to determining whether two affine functions can have the same value for different indices between the bounds of the loop
CA-Lec8 cwliutwinseenctuedutw 88
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Finding dependencies Examplebull Assume
ndash Load an array element with index c i + d and store to a i+ b
ndash i runs from m to nbull Dependence exists if the following two conditions hold
1 Given j k such that m le j le n m le k le n2 a j + b = c k + d
bull In general the values of a b c and d are not known at compile timendash Dependence testing is expensive but decidablendash GCD (greatest common divisor) test
bull If a loop‐carried dependence exists then GCD(ca) | |db|
CA-Lec8 cwliutwinseenctuedutw 89
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Example
for (i=0 ilt100 i=i+1) X[2i+3] = X[2i] 50
bull Solution1 a=2 b=3 c=2 and d=02 GCD(a c)=2 |b‐d|=33 Since 2 does not divide 3 no dependence is possible
CA-Lec8 cwliutwinseenctuedutw 90
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Remarks
bull The GCD test is sufficient but not necessaryndash GCD test does not consider the loop boundsndash There are cases where the GCD test succeeds but no dependence exists
bull Determining whether a dependence actually exists is NP‐complete
CA-Lec8 cwliutwinseenctuedutw 91
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Finding dependenciesbull Example 2
for (i=0 ilt100 i=i+1) Y[i] = X[i] c S1 X[i] = X[i] + c S2 Z[i] = Y[i] + c S3 Y[i] = c ‐ Y[i] S4
bull True dependence S1‐gtS3 (Y[i]) S1‐gtS4 (Y[i]) but not loop‐carried
bull Antidependence S1‐gtS2 (X[i]) S3‐gtS4 (Y[i]) (Y[i])bull Output dependence S1‐gtS4 (Y[i])
CA-Lec8 cwliutwinseenctuedutw
Detecting and E
nhancing Loop-Level Parallelism
92
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Renaming to Eliminate False (Pseudo) Dependences
bull Beforefor (i=0 ilt100 i=i+1) Y[i] = X[i] c X[i] = X[i] + c Z[i] = Y[i] + c Y[i] = c ‐ Y[i]
bull Afterfor (i=0 ilt100 i=i+1) T[i] = X[i] c X1[i] = X[i] + cZ[i] = T[i] + cY[i] = c ‐ T[i]
CA-Lec8 cwliutwinseenctuedutw 93
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Eliminating Recurrence Dependence
bull Recurrence example a dot productfor (i=9999 igt=0 i=i‐1)
sum = sum + x[i] y[i]
bull The loop is not parallel because it has a loop‐carried dependencebull Transform tohellip
for (i=9999 igt=0 i=i‐1)sum [i] = x[i] y[i]
for (i=9999 igt=0 i=i‐1)finalsum = finalsum + sum[i]
CA-Lec8 cwliutwinseenctuedutw 94
This is called scalar expansionScalar ===gt VectorParallel
This is called a reductionSums up the elements of the vectorNot parallel
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Reductionbull Reductions are common in linear algebra algorithmbull Reductions can be handled by special hardware in a vector
and SIMD architecturendash Similar to what can be done in multiprocessor environment
bull Example To sum up 1000 elements on each of ten processorsfor (i=999 igt=0 i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000p]
ndash Assume p ranges from 0 to 9
CA-Lec8 cwliutwinseenctuedutw 95
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96
Multithreading and Vector Summary
bull Explicitly parallel (DLP or TLP) is next step to performancebull Coarse‐grained vs Fine‐grained multithreading
ndash Switch only on big stall vs switch every clock cyclebull Simultaneous multithreading if fine grained multithreading
based on OOO superscalar microarchitecturendash Instead of replicating registers reuse rename registers
bull Vector is alternative model for exploiting ILPndash If code is vectorizable then simpler hardware more energy
efficient and better real‐time model than OOO machinesndash Design issues include number of lanes number of FUs number
of vector registers length of vector registers exception handling conditional operations and so on
bull Fundamental design issue is memory bandwidth
CA-Lec8 cwliutwinseenctuedutw 96