Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心教授片桐孝洋 Takahiro Katagiri, Professor, Information

Introduction to High Performance Programming

名古屋大学情報基盤中心教授片桐孝洋

Takahiro Katagiri, Professor, Information Technology Center, Nagoya University

Introduction to Parallel Programming for Multicore/Manycore Clusters

1

國家理論中心數學組「高效能計算」短期課程

Agenda1. Hierarchical Cache Memories2. Instruction Pipeline3. Loop Unrolling4. Continual Accesses for Arrays 5. Blocking6. Other Techniques to Establish Speedups


2

Hierarchical Cache Memories

Very High Speed Memories are Small Amount.


3

Memory Hierarchies on Recent Computers


4

High Speed

LargeAmount

Ｏ（1 Nano Seconds）

Ｏ（10 Nano Seconds）

Ｏ（100 Nano Seconds)

Ｏ（10 Miri Seconds)

Bytes

K Bytes ～Ｍ Bytes

Ｍ Bytes～Ｇ Bytes

Ｇ Bytes～T Bytes

Registers

Caches

Main Memory

Hard Disk

Cost from main memory to registers isＯ（100） times to access cost on registers.

More Intuitive Explanations…


5

A Register

A Cache

Main Memory

To establish high performance programming,we should modify programs to access very small spaces of its access area.(This calls “space locality.”)

An Example of Organization of Cache


6

Main Memory

Cache Memory

Registers

arithmetic unit

Requirementof Operation

ComputationResult

Data Supply Data Store

ＣＰＵ

8 9 10 11 12 13 14

0 1 2 3 4 6 7

Main Memory

Bank(Block)

Set

10 6

0 2 14

Cache Memory

LowerUpper

Physical Address

Inside Blocks

10 6

0 2 14

CacheDirectories

Cache Organization

Data Supply Data Store

A Memory Organization of the Fujitsu FX10 (A K-Computer Type Supercomputer)


7

Registers

Level 1 Cache（32K Bytes/core)

Level 2 Cache（12M Bytes/16 cores）

Main Memory（３２G Bytes / Node）

HighSpeed

LargeAmount

●Data

●Data

●Data


8

High speed access is established, if data is on the L1 cache.

A Memory Organization of the Fujitsu FX10 (A K-Computer Type Supercomputer)


8

Registers

Level 1 Cache（32K Bytes/core)

Level 2 Cache（12M Bytes/16 cores）

Main Memory（３２G Bytes / Node）

HighSpeed

LargeAmount

●Data

●Data

An Example of Memory Organization for Each Node on the FX10


9

This is a hierarchical memory organization.

Mein Memory

Ｌ１Ｌ１

Core0

Core1

Ｌ１Ｌ１

Core2

Core3

L2

Ｌ１Ｌ１

Core12

Core13

Ｌ１Ｌ１

Core14

Core15

…

An Example of Memory Organization for Whole System on the FX10


10

Memory is hierarchical.

…

TOFU Network（５ GBytes / s.×Both Way）

…

Mein Memory

Ｌ１

Ｌ１

Core0

Core1

Ｌ１

Ｌ１

Core2

Core3

L2

Ｌ１

Ｌ１

Core12

Core13

Ｌ１

Ｌ１

Core14

Core15

…

Mein Memory

Ｌ１

Ｌ１

Core0

Core1

Ｌ１

Ｌ１

Core2

Core3

L2

Ｌ１

Ｌ１

Core12

Core13

Ｌ１

Ｌ１

Core14

Core15

…

Mein Memory

Ｌ１

Ｌ１

Core0

Core1

Ｌ１

Ｌ１

Core2

Core3

L2

Ｌ１

Ｌ１

Core12

Core13

Ｌ１

Ｌ１

Core14

Core15

…

Mein Memory

Ｌ１

Ｌ１

Core0

Core1

Ｌ１

Ｌ１

Core2

Core3

L2

Ｌ１

Ｌ１

Core12

Core13

Ｌ１

Ｌ１

Core14

Core15

…

…Mein Memory

Ｌ１

Ｌ１

Core0

Core1

Ｌ１

Ｌ１

Core2

Core3

L2

Ｌ１

Ｌ１

Core12

Core13

Ｌ１

Ｌ１

Core14

Core15

…

Mein Memory

Ｌ１

Ｌ１

Core0

Core1

Ｌ１

Ｌ１

Core2

Core3

L2

Ｌ１

Ｌ１

Core12

Core13

Ｌ１

Ｌ１

Core14

Core15

…

Mein Memory

Ｌ１

Ｌ１

Core0

Core1

Ｌ１

Ｌ１

Core2

Core3

L2

Ｌ１

Ｌ１

Core12

Core13

Ｌ１

Ｌ１

Core14

Core15

…

Mein Memory

Ｌ１

Ｌ１

Core0

Core1

Ｌ１

Ｌ１

Core2

Core3

L2

Ｌ１

Ｌ１

Core12

Core13

Ｌ１

Ｌ１

Core14

Core15

…


11

Node Organization ofthe FX10

Memory Memory Memory

Inside CPU organizationCore

#1Core

#2Core

#3Core

#0

Only One Socket

Core#13

Core#14

Core#15

Core#12…

L2 (Shared with 16 cores, 12MB)

L1 L1 L1 L1 L1 L1 L1 L1: L1 Data Cache 32KB

85GB/s.=(8Byte×1333MHz

×8 channels)

DDR3 DIMMMemory

4GB ×2 channels4GB ×2 channels

4GB ×2 channels4GB ×2 channels

Total memory amount：8GB×4＝32GB

20GB/s.

TOFUNetwork

ICC

Specification of CPU for the FX10 (SPARC64IXfx)Items Value

ArchitectureName

HPC-ACE (Extended SPARC-V9 Instruction Set)

Frequency 1.848GHz

L1 Cache 32 Kbytes (Instruction and Data are separated.)

L2 Cache 12 Mbytes

SoftwareControlledCache

Sector Cache

InstructionIssue

2 Integer Operation Units4 Floating point Multiplication-and-add Units （FMA）

SIMDInstruction

2 FMAs are working per one instruction.2 Floating point Operations (add and Multiplication) can beexecuted.

Registers Number of registers for floating point computations：256

Others Instruction for sin and cos. Instruction for conditional executions. Instruction for division and sqrt.12

13Reading：240GB/s.Writing：240GB/s. = Total：480GB/s.

Node Organization of the FX100

Core#17

Core#18

Core#19

Core#16

Core#29

Core#30

Core#31

Core#28…


L1 L1 L1 L1 L1 L1 L1 L1

: L1 DataCache64KB

HMC16GB Total Memory Amount per node：32GB

Memory

Socket0 (CMG(Core Memory Group))

Core#1

Core#2

Core#3

Core#0

Core#13

Core#14

Core#15

Core#12

…L1 L1 L1 L1 L1 L1 L1 L1: L1 Data

Cache64KB

HMC16GB

TOFU2Network

Assist.Core

L1

Assist.Core

L1

2 Sockets, NUMA(Non Uniform Memory Access)


Socket１ (CMC)

…

…

…

…

Memory

ICC

Architectural Comparison between the FX10 and the FX100


14

source：https://www.ssken.gr.jp/MAINSITE/event/2015/20151028-sci/lecture-04/SSKEN_sci2015_miyoshi_presentation.pdf

FX10 FX100Computation Capacity/ Node

Double / Single236 GFLOPS

Single ：1.011 TFLOPSDouble：2.022 TFLOPS

Number of Cores 16 32

Assistant Cores None 2

SIMD Length 128 256

SIMD Operation Floating Point Operations,Continuous Load/Store

In addition to the right, Integer Operations, Stridden and Indirect Load/Store

L1D Cache/core 32KB, 2 Ways 64KB, 4 Ways

L2 Cache/node 12MB 24MB

Memory Bandwidth 85GB/s. 480GB/s. (HMC)

Specification of the FX100 (Nagoya u.)CPU(SPARC64XIfx)

Items ValueArchitecture Name HPC-ACE2 (Extended SPARC-V9 Instruction Set)Frequency 2.2 GHzL1 Cache 64 Kbytes (Instruction and Data are separated.)L2 Cache 24 MbytesSoftwareControlled Cache

Sector Cache

Instruction Issue 2 Integer Operation Units.8 Floating point Multiplication-and-add Units（FMA）.

SIMD Instruction 2 FMAs are working per one instruction.4 Floating point Operations (add andMultiplication) can be executed.

Registers Number of registers for floating pointcomputations：256

Others


15

Instruction Pipeline

An assembly line work of computations.


16

Assembly line work For example, assembling cars. One worker owns one procedure.（5 workers）

Let us 2 months for the above procedures.(Each process needs 0.4 month) After 2 months, the first car is made. After 4 months, the second car are made. Eff.: 2 months par car


17

Making Car BodyPut on Front

and rear windows

Interior Finishing

Exterior Finishing

Confirming Functions

車体作成

フロント・バックガラスをつける

内装外装機能確

認

車体作成



認

車体作成


内装外装機能確認 Time

FirstSecondThird

Each worker is working for 0.4 month, then they take a rest for 1.6 months.

-> too bad efficiency.

Assembly Line Work We can take 5 working places. Waiting for previous process, then car is moved

to next process as soon as possible. A Belt Conveyor System.


18

0.4 Month 0.4 Month 0.4 Month 0.4 Month 0.4 Month

Making Car BodyPut on Front

and rear windows

Interior Finishing

Exterior Finishing

Confirming Functions

Assembly Line Work In this method, we can make: the first car in 2 months. the second car in 2.4 months. the third car in 2.8 months. the forth car in 3.2 months. the firth car in 3.4 months the sixth car in 3.8 months. Efficiency: 0.63 month per car.


19

車体作成



認

車体作成



認

車体作成



認

Time車体作成



認

車体作成



認

FirstSecondThirdForthFifth

Each worker keeps working for 0.4 month when time is enough large. -> Very high efficiency.

This process calls<Pipeline Process> .

Pipeline Process on Computers1. Hardware Pipelining Performing it by computer hardware. Typical processes:

1. Computation process with pipelining.2. Data sending (operand code, data) from memory with pipelining.

2. Software Pipelining Performing it by writing programs. Typical processes:

1. Pipelining by compiler.（Instruction preload, data prefetch, data post-store. )

2. Pipelining by program modification by hand. （Data preload, loop unrolling)


20

In Case of Arithmetic Unit (AU) Ex) Process for AU （Note) Process for real AU is different from follows.)

Ex) Matrix-vector multiplication:for (j=0; j<n; j++)

for (i=0; i<n; i++) {y[j] += A[j][i] * x[i] ;

}

If we do not use pipelining, the process is like as follow.


21

Fetch data A from memory

Fetch data B from memory

Issue Computation Store Result

Time

Performing AU.






In Case of Arithmetic Unit (AU) This is using only one unit time during 4 units time in AU;

hence poor efficiency > Computation efficiency: ¼=25%. If we establish pipelining as follows, computation is made

in every time unit when enough time is passed.> Computation efficiency: 100%


22 Time…

The enough time is enough large loop length. Large loop length of N makes no delay pipelining, then it makes high efficiency. > If N is small, efficiency goes poor.



















Summary of Computation Pipelining A Concept to establish full usage of AU.

> To do high performance Computing. Time of data fetch from main memory is very slow. Making

scheduling of load instruction by pipelining, the fetch time can be hidden. > AU is working AU in every time unit.

In real execution, it is not easy because of the following issues. 1. Delay by organization of computer architectures, such as restriction of

number of registers, data supply limitation for from memory to CPU, and from CPU to memory, etc. *CPU of the FX100 is based on Sparc64.

2. Process for loops, such as initialization and addition for loop induction variables ( I , j ), evaluation of loop finalizing.

3. Computation of memory address to access data of arrays.4. Whether compiler generates pipelined codes or not.

Note for Actual CPUs In actual CPUs, they have independent pipelines for:

1. Addition and Substitution, and2. Multiplication.

Moreover, there are several instructions for simultaneous pipelining operations.

In the Intel Pentium4, number of pipelining depths is 31! Time to full working of pipelining is very large. If many branches, prediction misses for instructions are

happened, efficiency goes poor.

Recent multi-core and many core CPUs require small depth of pipelining because of low frequency. Xeon Phi is 7 for depth of pipelining.


24

Hardware Information for the FX10 8 computations per clock can be performed. 2 multiplications and adds per FMA.（4 Floating Point Operations）

2 FMAs is working per clock. 4 Floating Point Operations ×2FMAs = 8 Floating Operations

per clock. It has1.848 GHz clock per core, hence: Theoretical peak performance is:

1.848 GHz * 8 times = 14.784 GFLOPS / core 1 Node (16 cores)

14.784 * 16 cores = 236.5 GFLOPS / node

Number of registers for floating point operations 256 registers / core


25

Loop UnrollingCompilers do not always perform this automatically.


26

What is “Loop Unrolling”?

It is a tuning technique in compilers by rewriting codes to enhance: 1. Register allocations for data；2. Pipelining；

Changing loop stride to m, not for 1. <Depth m unrolling>


27

Loop Unrolling In technical term for compiler, it is

defined by unrolling the most inner loop. Narrow sense definition of loop unrolling.

Computation scientists defines unrolling multiple loops. Broad sense definition for loop unrolling. This is defined as a loop restructuring by

compiler optimizations.


28

An Example of Loop Unrolling（Matrix-Matrix Multiplications, C Language） k-loop Depth 2 Unrolling, where is n is dividable by 2.

for (i=0; i<n; i++) for (j=0; j<n; j++)

for (k=0; k<n; k+=2) C[i][j] += A[i][k] *B[k][j] + A[i][k+１]*B[k+１][j];

Number of times for loop counter checking in k-loopwill be half.


29

j-loop Depth 2 Unrolling, where is n is dividable by 2.

for (i=0; i<n; i++) for (j=0; j<n; j+=2)

for (k=0; k<n; k++) {C[i][j ] += A[i][k] *B[k][j ];C[i][j+１] += A[i][k] *B[k][j+１];

}

By allocating A[i][k] into a register, it can be accessed with high speed.


30

An Example of Loop Unrolling（Matrix-Matrix Multiplications, C Language）

i-loop Depth 2 Unrolling, where is n is dividable by 2.

for (i=0; i<n; i+=2) for (j=0; j<n; j++)

for (k=0; k<n; k++) {C[i ][j] += A[i ][k] *B[k][j];C[i+１][j] += A[i+１][k] *B[k][j];

}

By allocating B[i][k] into a register, it can be accessed with high speed.


31


i-loop and j-loop Depth 2 Unrolling, where is n is dividable by 2.for (i=0; i<n; i+=2)

for (j=0; j<n; j+=2) for (k=0; k<n; k++) {

C[i ][j ] += A[i ][k] *B[k][j ];C[i ][j+１] += A[i ][k] *B[k][j+１]; C[i+１][j ] += A[i+１][k] *B[k][j ]; C[i+１][j+１] += A[i+１][k] *B[k][j +１];

} By allocating A[i][j], A[i+１][k],B[k][j],B[k][j+１] into registers,

it can be accessed with high speed.Introduction to Parallel Programming for

Multicore/Manycore Clusters32


To do easy understanding for compilers, the following code is better in some situations.for (i=0; i<n; i+=2)

for (j=0; j<n; j+=2) {

dc00 = C[i ][j ]; dc01 = C[i ][j+１]; dc10 = C[i+１][j ]; dc11 = C[i+１][j+１] ; for (k=0; k<n; k++) {

da0= A[i ][k] ; da1= A[i+１][k] ;db0= B[k][j ]; db1= B[k][j+１]; dc00 += da0 *db0; dc01 += da0 *db1; dc10 += da1 *db0; dc11 += da1 *db1;

}C[i ][j ] = dc00; C[i ][j+１] = dc01; C[i+１][j ] = dc10; C[i+１][j+１] = dc11;

}


33


k-loop Depth 2 Unrolling, where is n is dividable by 2.

do i=１, ndo j=１, n

do k=１, n, 2C(i, j) = C(i, j) +A(i, k) *B(k, j) + A(i, k+１)*B(k+１, j)

enddoenddo

enddoNumber of times for loop counter checking in k-loop will

be half.


34

An Example of Loop Unrolling（Matrix-Matrix Multiplications, Fortran Language）


j-loop Depth 2 Unrolling, where is n is dividable by 2.

do i=１, ndo j=１, n, 2

do k=１, nC(i, j ) = C(i, j ) +A(i, k) * B(k, j )C(i, j+１) = C(i, j+１) +A(i, k) * B(k, j+１)

enddoenddo

enddo By allocating A(i , k) into a register, it can be accessed with high

speed. Introduction to Parallel Programming for



i-loop Depth 2 Unrolling, where is n is dividable by 2.

do i=１, n, 2do j=１, n

do k=１, nC(i , j) = C(i , j) +A(i , k) * B(k , j)C(i+１, j) = C(i+１, j) +A(i+１, k) * B(k , j)

enddoenddo

enddo By allocating B(i, k) into a register, it can be accessed with high

speed.Introduction to Parallel Programming for



i-loop and j-loop Depth 2 Unrolling, where is n is dividable by 2.do i=１, n, 2

do j=１, n, 2do k=１, nC(i , j ) = C(i , j ) +A(i , k) *B(k, j )C(i , j+１) = C(i , j+１) +A(i , k) *B(k, j+１) C(i+１, j ) = C(i+１, j ) +A(i+１, k) *B(k, j ) C(i+１, j+１) =C(i+１, j+１) +A(i+１, k) *B(k, j +１)

enddo; enddo; enddo; By allocating A(i, j), A(i+１, k),B(k, j),B(k, j+１) into registers,

it can be accessed with high speed.Introduction to Parallel Programming for


An Example of Loop Unrolling（Matrix-Matrix Multiplications, Fortran Language） To do easy understanding for compilers, the following

code is better in some situations.do i=１, n, 2

do j=１, n, 2

dc00 = C(i ,j ); dc01 = C(i ,j+１) dc10 = C(i+１,j ); dc11 = C(i+１,j+１) do k=１, n

da0= A(i ,k); da1= A(i+１, k)db0= B(k ,j ); db1= B(k, j+１) dc00 = dc00+da0 *db0; dc01 = dc01+da0 *db1; dc10 = dc10+da1 *db0; dc11 = dc11+da1 *db1;

enddoC(i , j ) = dc00; C(i , j+１) = dc01C(i+１, j ) = dc10; C(i+１, j+１) = dc11

enddo; enddo


38

Continual Accesses for Arrays

Stride access makes poor performance.


39

Storage Formats for ArrayC LanguageＡ［i］［j］


40

1

5

9

13

2

6

10

14

3

7

11

15

4

8

12

16

Direction of Continual Access

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Direction ofContinual Access

Fortran LanguageＡ（i, j）

i

j

i

j

Cache and Cache Line Data mapping methods between main memory and cache. From main memory to cache:

Direct Mapping： Direct mapping by unit (memory bank). Set Associative： Indirect mapping with hash function.

From cache to main memory: Store Through：Writing content to main memory as soon as data is

written in cache. Store In：Writing content to main memory when data on cache line is

replaced.


41…

Main Memory

MemoryBank

Line 0Line 1Line 2Line 3Line 4Line 5

Cache

A MappingFunction

CacheLines

…

Cache Line Conflict We consider direct mapping, which is directly mapping from

address on main memory to cache. The mapping stride is 4 in this example.

Data on main memory maps to same cache line with stridden 4.

In this example, opposite access is made to direction of continuous access. >When we use C language, i-direction is accessed.


42

Main Memory

Line 0Line 1Line 2Line 3

CacheCacheLines

１２３４５６７８９１０１１１２１３１４１５１６

…

ContinuousAccess

Access Direction

１２３４５６７８９１０１１１２１３１４１５１６

…

Cache line Conflict1. In this situation, data 5 is accessed as soon as data 1 is

stored in cache line 0. Hence data 1 in cache line 0should be gone out immediately.

2. As same as 1, data 9 is accessed as soon as data 5 is stored in cache line 0. Hence data 5 in cache line 0 should be gone out immediately.


43

Main Memory


CacheCacheLines

ContinuousAccess

１５９

To registers

Access Direction

Cache line Conflict The 1 and 2 are continuous happened in this example. Data transfer line is full due to data movement from

main memory to cache. Hence it is impossible to additional data movement.

This is as same as sequential access from main memory. It is same situation for no cache. Data cannot reach to arithmetic unit when it needs,

then computation is stopped. Hence performance goes down.

This phenomena is called by <Cache Thrashing> or <Cache line Conflict>.


44

Memory Interleaving In case of physical direct access for array. When data is accesses, it is highly happened by near bank

accesses from current access bank. With using this, there is a function to move data on the near banks to cache lines.

When data on line 0 is accessed, it can perform parallel data transfer from a near bank to line 1. This calls <Memory Interleaving>.

From viewpoint of arithmetic unit (AU), time of data access can be shorten.

Time for idol time of AU can be shorten. >This makes high efficiency.


45

Make continual access loop for data storage in your program!

Condition of Cache Line Conflict Memory bank allocation to cache lines is usually

done with power of 2. For example, 32, 64, and 128. If performance goes down with factor of 1/2~1/3,

sometimes 1/10, in dedicated size of array, such as 1024, cache line conflict may be happened. Actual program behaves very complex, hence it is

difficult to find condition of cache line conflict. But, Allocate array with size of power of 2 should be avoided.


46

How to avoid cache line conflict? To avoid cache line conflict, we can apply the follows:

1. Patting：Allocate additional contents of array to avoid size of power of 2. Then use a part of the allocated array.

There is a compile option for patting.

2. Data Compression：Allocate a new array to use it for computation, and move data to the newly allocated array.

3. Computation Prediction：A routine to predict cache line conflict is included to program. When allocate array, the routine is called and checked the cache line conflict.


47

Blocking

Data reuse for small access area


48

Access Localization by Blocking There is a size for cache. Even in performing continuous access, data goes out

from cache when access area exceeds cache size. If data goes out from cache in frequently, it means

same situation from main memory to cache. In this situation, no advantage for high speed cache is obtained.

Hence the followings is needed to do high performance1. Store full data to cache lines;2. Reuse data on the cache lines in many times;


49

An Example of Cache miss hits by Blocking

Matrix-matrix Multiplications Matrix Size：8×8 double A[8][8];

Number of Cache Lines: 4 4 elements of array can be on a cache line. Cache Line Amount：4×8 Bytes (double)=32 Bytes

Row wise is continual access for array. (C language) Cache Replacement Algorithm:：

Least Recently Used (LRU) Introduction to Parallel Programming for


Relationship between Organizations ofArray and Cache Lines In this situation, relationship between array and cache lines is

as follow. We do not consider cache line conflict.


51

In C Language: Arrays: A［i］［j］、B[i][j]、C[i][j]

i

j

Storage Direction

１

Organization ofCache Lines

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

234

Elements of 1×4 in arrays are on each cache line. Which line is used is determined

by <Access pattern of the arrays>and <Cache Replacement Algorithm>.

Matrix-matrix Multiplications (Non Blocking)


52

＝

ＣＡＢ

＊

Cache LineNumber of cache linesIs 4. Cache line replacementAlgorithm is LRU.

Cache Miss ①


Cache Miss ②

Cache Miss ③

Cache Miss ④

Cache Miss ⑤

LRU: Data on line that does not access recently goes out.



53

＝

ＣＡＢ

＊

キャッシュライン

Cache Miss ⑥

Cache Miss ⑦

Cache Miss ⑧

Cache Miss ⑨

Cache Miss ⑩

Cache Miss １１

Number of cache linesIs 4. Cache line replacementAlgorithm is LRU.




54

＝

ＣＡＢ

＊

キャッシュライン

Cache Miss Cache Miss Cache Miss

Cache Miss Cache Miss

Cache MissCache Miss


Cache Miss

Cache Miss

Number of Cache Miss hit is 22during 2 elements of C.



Matrix-matrix Multiplications(Blocking: 2 elements)


55

＝

ＣＡＢ

＊

Cache Line

Cache Miss Cache Miss Cache Miss


Cache Miss

Computing with thisblocking unit

１２

１

①

①

②

②



Matrix-matrix Multiplications(Blocking: 2 elements)


56

＝

ＣＡＢ

＊

Cache Line


Cache Miss

Cache Miss

Cache Miss

Cache Miss

Computing with thisblocking unit

１

１

③ ④

③

④

２



Number of Cache Miss hit is 10during 2 elements of C.

Matrix-Matrix Multiplication Code（Ｃ Language）: Without blockingCode Example:

for (i=0; i<n; i++) for (j=0; j<n; j++)

for (k=0; k<n; k++) C[i][j] += A[i][k] *B[k][j];


57

C A Bi

j

i

k

k

j

Matrix-Matrix Multiplication Code(Ｃ Language) : With blocking When n is not dividable for the blocking size (ibl=１6), it

make the following six-nested loop:


58

ibl = 16;for ( ib=0; ib<n; ib+=ibl ) {

for ( jb=0; jb<n; jb+=ibl ) {for ( kb=0; kb<n; kb+=ibl ) {for ( i=ib; i<ib+ibl; i++ ) {for ( j=jb; j<jb+ibl; j++ ) {for ( k=kb; k<kb+ibl; k++ ) {C[i][j] += A[i][k] * B[k][j];

} } } } } }

Matrix-Matrix Multiplication Code(Fortran Language) : With blocking When n is not dividable for the blocking size (ibl=１6), it

make the following six-nested loop:


59

ibl = 16do ib=1, n, ibl

do jb=1, n, ibldo kb=1, n, ibl

do i=ib, ib+ibl-1do j=jb, jb+ibl-1

do k=kb, kb+ibl-1C(i, j) = C(i, j) + A(i, k) * B(k, j)

enddo; enddo; enddo; enddo; enddo; enddo;

Data Access Pattern in Case of Cache Blocking


60

C A B

＝ ×

ibl

ibl

ibl

ibl

ibl

ibl

Perform matrix-matrixMultiplication with small matrix ibl×ibl

Data Access Pattern in Case of Cache Blocking


61

C A B

＝ ×

ibl

ibl

ibl

ibl ibl

ibl

Perform matrix-matrixMultiplication with small matrix ibl×ibl

Unrolling for Blocked Matrix-matrix Multiplication (C Language) Unrolling for 6-nested loop to blocked matrix-matrix

multiplication (6-nested loop) can be adapted. i-loop and j-loop with depth 2 unrolling is as follows, where

the blocking depth ibl can be divided by 2.


62

ibl = 16;for (ib=0; ib<n; ib+=ibl) {for (jb=0; jb<n; jb+=ibl) {

for (kb=0; kb<n; kb+=ibl) {for (i=ib; i<ib+ibl; i+=2) {for (j=jb; j<jb+ibl; j+=2) {

for (k=kb; k<kb+ibl; k++) {C[i ][j ] += A[i ][k] * B[k][j ];C[i+1][j ] += A[i+1][k] * B[k][j ];C[i ][j+1] += A[i ][k] * B[k][j+1];C[i+1][j+1] += A[i+1][k] * B[k][j+1];

} } } } } }

Unrolling for Blocked Matrix-matrix Multiplication (Fortran Language)

Unrolling for 6-nested loop to blocked matrix-matrix multiplication (6-nested loop) can be adapted.

i-loop and j-loop with depth 2 unrolling is as follows, where the blocking depth ibl can be divided by 2.


63

ibl = 16do ib=1, n, ibl

do jb=1, n, ibldo kb=1, n, ibldo i=ib, ib+ibl, 2

do j=jb, jb+ibl, 2do k=kb, kb+ibl

C(i , j ) = C(i , j ) + A(i , k) * B(k, j )C(i+1, j ) = C(i+1, j ) + A(i+1, k) * B(k, j )C(i , j+1) = C(i , j+1) + A(i , k) * B(k, j+1)C(i+1, j+1) = C(i+1, j+1) + A(i+1, k) * B(k, j+1)

enddo; enddo; enddo; enddo; enddo; enddo;

Other Techniques to Establish Speedups


64

Delete Common Statements (1) There is redundant computation in the following program.

d = a + b + c;f = d + a + b;

Some compiler optimizes this, but the following is better.

temp = a + b;d = temp + c;f = d + temp;


65

Delete Common Statements (2) It is better to avoid redundant computation for access of array.

for (i=0; i<n; i++) {xold[i] = x[i];x[i] = x[i] + y[i];

} The following is the one.

for (i=0; i<n; i++) {dtemp = x[i];xold[i] = dtemp;x[i] = dtemp + y[i];

}


66

Movement of Codes Division takes much time. Do not write it inside loop if possible.

for (i=0; i<n; i++) {a[i] = a[i] / sqrt(dnorm);

} In the case of above, multiplication should be adapted.

dtemp = １.0 / sqrt(dnorm);for (i=0; i<n; i++) {

a[i] = a[i] *dtemp;}


67

IF Sentences Inside Loop Do not write if sentences as much as possible.

for (i=0; i<n; i++) {for (j=0; j<n; j++) {

if ( i != j ) A[i][j] = B[i][j];else A[i][j] = 1.0;

} }

In this case, the following is better.for (i=0; i<n; i++) {

for (j=0; j<n; j++) {A[i][j] = B[i][j];

} }for (i=0; i<n; i++) A[i][i] = 1.0;


68

Strengthen Software Pipelining


69

for (i=0; i<n; i+=2) {dtmpb0 = b[i];dtmpc0 = c[i];dtmpa0 = dtmpb0 + dtmpc0;a[i] = dtmpa0;dtmpb1 = b[i+1];dtmpc1 = c[i+1];dtmpa1 = dtmpb1 + dtmpc1;a[i+1] = dtmpa1;

}

for (i=0; i<n; i+=2) { dtmpb0 = b[i];dtmpb1 = b[i+1];dtmpc0 = c[i];dtmpc1 = c[i+1];dtmpa0 = dtmpb0 + dtmpc0;dtmpa1 = dtmpb1 + dtmpc1;a[i] = dtmpa0;a[i+1] = dtmpa1;

}

Original Code(Depth 2 unrolling)

Code for Strengthen Softwarepipelining (Depth 2 unrolling)

Distance between definition-use is short.> Noting can be adaptedin viewpoint of software.

Distance between definition-use is large.> Causing many opportunities for software pipelining.

Documents

Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心教授片桐孝洋 Takahiro Katagiri, Professor, Information