69
Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information Technology Center, Nagoya University Introduction to Parallel Programming for Multicore/Manycore Clusters 1 國家理論中心數學組「高效能計算 」短期課程

Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Introduction to High Performance Programming

名古屋大学情報基盤中心 教授 片桐孝洋

Takahiro Katagiri, Professor, Information Technology Center, Nagoya University

Introduction to Parallel Programming for Multicore/Manycore Clusters

1

國家理論中心數學組「高效能計算 」短期課程

Page 2: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Agenda1. Hierarchical Cache Memories2. Instruction Pipeline3. Loop Unrolling4. Continual Accesses for Arrays 5. Blocking6. Other Techniques to Establish Speedups

Introduction to Parallel Programming for Multicore/Manycore Clusters

2

Page 3: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Hierarchical Cache Memories

Very High Speed Memories are Small Amount.

Introduction to Parallel Programming for Multicore/Manycore Clusters

3

Page 4: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Memory Hierarchies on Recent Computers

Introduction to Parallel Programming for Multicore/Manycore Clusters

4

High Speed

LargeAmount

O(1 Nano Seconds)

O(10 Nano Seconds)

O(100 Nano Seconds)

O(10 Miri Seconds)

Bytes

K Bytes ~ M Bytes

M Bytes~G Bytes

G Bytes~T Bytes

Registers

Caches

Main Memory

Hard Disk

Cost from main memory to registers isO(100) times to access cost on registers.

Page 5: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

More Intuitive Explanations…

Introduction to Parallel Programming for Multicore/Manycore Clusters

5

A Register

A Cache

Main Memory

To establish high performance programming,we should modify programs to access very small spaces of its access area.(This calls “space locality.”)

Page 6: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

An Example of Organization of Cache

Introduction to Parallel Programming for Multicore/Manycore Clusters

6

Main Memory

Cache Memory

Registers

arithmetic unit

Requirementof Operation

ComputationResult

Data Supply Data Store

CPU

8 9 10 11 12 13 14

0 1 2 3 4 6 7

Main Memory

Bank(Block)

Set

10 6

0 2 14

Cache Memory

LowerUpper

Physical Address

Inside Blocks

10 6

0 2 14

CacheDirectories

Cache Organization

Data Supply Data Store

Page 7: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

A Memory Organization of the Fujitsu FX10 (A K-Computer Type Supercomputer)

Introduction to Parallel Programming for Multicore/Manycore Clusters

7

Registers

Level 1 Cache(32K Bytes/core)

Level 2 Cache(12M Bytes/16 cores)

Main Memory(32G Bytes / Node)

HighSpeed

LargeAmount

●Data

●Data

●Data

Page 8: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Introduction to Parallel Programming for Multicore/Manycore Clusters

8

High speed access is established, if data is on the L1 cache.

A Memory Organization of the Fujitsu FX10 (A K-Computer Type Supercomputer)

Introduction to Parallel Programming for Multicore/Manycore Clusters

8

Registers

Level 1 Cache(32K Bytes/core)

Level 2 Cache(12M Bytes/16 cores)

Main Memory(32G Bytes / Node)

HighSpeed

LargeAmount

●Data

●Data

Page 9: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

An Example of Memory Organization for Each Node on the FX10

Introduction to Parallel Programming for Multicore/Manycore Clusters

9

This is a hierarchical memory organization.

Mein Memory

L1 L1

Core0

Core1

L1 L1

Core2

Core3

L2

L1 L1

Core12

Core13

L1 L1

Core14

Core15

Page 10: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

An Example of Memory Organization for Whole System on the FX10

Introduction to Parallel Programming for Multicore/Manycore Clusters

10

Memory is hierarchical.

TOFU Network(5 GBytes / s.×Both Way)

Mein Memory

L1

L1

Core0

Core1

L1

L1

Core2

Core3

L2

L1

L1

Core12

Core13

L1

L1

Core14

Core15

Mein Memory

L1

L1

Core0

Core1

L1

L1

Core2

Core3

L2

L1

L1

Core12

Core13

L1

L1

Core14

Core15

Mein Memory

L1

L1

Core0

Core1

L1

L1

Core2

Core3

L2

L1

L1

Core12

Core13

L1

L1

Core14

Core15

Mein Memory

L1

L1

Core0

Core1

L1

L1

Core2

Core3

L2

L1

L1

Core12

Core13

L1

L1

Core14

Core15

…Mein Memory

L1

L1

Core0

Core1

L1

L1

Core2

Core3

L2

L1

L1

Core12

Core13

L1

L1

Core14

Core15

Mein Memory

L1

L1

Core0

Core1

L1

L1

Core2

Core3

L2

L1

L1

Core12

Core13

L1

L1

Core14

Core15

Mein Memory

L1

L1

Core0

Core1

L1

L1

Core2

Core3

L2

L1

L1

Core12

Core13

L1

L1

Core14

Core15

Mein Memory

L1

L1

Core0

Core1

L1

L1

Core2

Core3

L2

L1

L1

Core12

Core13

L1

L1

Core14

Core15

Page 11: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Introduction to Parallel Programming for Multicore/Manycore Clusters

11

Node Organization ofthe FX10

Memory Memory Memory

Inside CPU organizationCore

#1Core

#2Core

#3Core

#0

Only One Socket

Core#13

Core#14

Core#15

Core#12…

L2 (Shared with 16 cores, 12MB)

L1 L1 L1 L1 L1 L1 L1 L1: L1 Data Cache 32KB

85GB/s.=(8Byte×1333MHz

×8 channels)

DDR3 DIMMMemory

4GB ×2 channels4GB ×2 channels

4GB ×2 channels4GB ×2 channels

Total memory amount:8GB×4=32GB

20GB/s.

TOFUNetwork

ICC

Page 12: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Specification of CPU for the FX10 (SPARC64IXfx)Items Value

ArchitectureName

HPC-ACE (Extended SPARC-V9 Instruction Set)

Frequency 1.848GHz

L1 Cache 32 Kbytes (Instruction and Data are separated.)

L2 Cache 12 Mbytes

SoftwareControlledCache

Sector Cache

InstructionIssue

2 Integer Operation Units4 Floating point Multiplication-and-add Units (FMA)

SIMDInstruction

2 FMAs are working per one instruction.2 Floating point Operations (add and Multiplication) can beexecuted.

Registers Number of registers for floating point computations:256

Others Instruction for sin and cos. Instruction for conditional executions. Instruction for division and sqrt.12

Page 13: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

13Reading:240GB/s.Writing:240GB/s. = Total:480GB/s.

Node Organization of the FX100

Core#17

Core#18

Core#19

Core#16

Core#29

Core#30

Core#31

Core#28…

L2 (Shared with 17 cores, 12MB)

L1 L1 L1 L1 L1 L1 L1 L1

: L1 DataCache64KB

HMC16GB Total Memory Amount per node:32GB

Memory

Socket0 (CMG(Core Memory Group))

Core#1

Core#2

Core#3

Core#0

Core#13

Core#14

Core#15

Core#12

…L1 L1 L1 L1 L1 L1 L1 L1: L1 Data

Cache64KB

HMC16GB

TOFU2Network

Assist.Core

L1

Assist.Core

L1

2 Sockets, NUMA(Non Uniform Memory Access)

L2 (Shared with 17 cores, 12MB)

Socket1 (CMC)

Memory

ICC

Page 14: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Architectural Comparison between the FX10 and the FX100

Introduction to Parallel Programming for Multicore/Manycore Clusters

14

source:https://www.ssken.gr.jp/MAINSITE/event/2015/20151028-sci/lecture-04/SSKEN_sci2015_miyoshi_presentation.pdf

FX10 FX100Computation Capacity/ Node

Double / Single236 GFLOPS

Single :1.011 TFLOPSDouble:2.022 TFLOPS

Number of Cores 16 32

Assistant Cores None 2

SIMD Length 128 256

SIMD Operation Floating Point Operations,Continuous Load/Store

In addition to the right, Integer Operations, Stridden and Indirect Load/Store

L1D Cache/core 32KB, 2 Ways 64KB, 4 Ways

L2 Cache/node 12MB 24MB

Memory Bandwidth 85GB/s. 480GB/s. (HMC)

Page 15: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Specification of the FX100 (Nagoya u.)CPU(SPARC64XIfx)

Items ValueArchitecture Name HPC-ACE2 (Extended SPARC-V9 Instruction Set)Frequency 2.2 GHzL1 Cache 64 Kbytes (Instruction and Data are separated.)L2 Cache 24 MbytesSoftwareControlled Cache

Sector Cache

Instruction Issue 2 Integer Operation Units.8 Floating point Multiplication-and-add Units(FMA).

SIMD Instruction 2 FMAs are working per one instruction.4 Floating point Operations (add andMultiplication) can be executed.

Registers Number of registers for floating pointcomputations:256

Others

Introduction to Parallel Programming for Multicore/Manycore Clusters

15

Page 16: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Instruction Pipeline

An assembly line work of computations.

Introduction to Parallel Programming for Multicore/Manycore Clusters

16

Page 17: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Assembly line work For example, assembling cars. One worker owns one procedure.(5 workers)

Let us 2 months for the above procedures.(Each process needs 0.4 month) After 2 months, the first car is made. After 4 months, the second car are made. Eff.: 2 months par car

Introduction to Parallel Programming for Multicore/Manycore Clusters

17

Making Car BodyPut on Front

and rear windows

Interior Finishing

Exterior Finishing

Confirming Functions

車体作成

フロント・バックガラスをつける

内装 外装機能確

車体作成

フロント・バックガラスをつける

内装 外装機能確

車体作成

フロント・バックガラスをつける

内装 外装機能確認 Time

FirstSecondThird

Each worker is working for 0.4 month, then they take a rest for 1.6 months.

-> too bad efficiency.

Page 18: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Assembly Line Work We can take 5 working places. Waiting for previous process, then car is moved

to next process as soon as possible. A Belt Conveyor System.

Introduction to Parallel Programming for Multicore/Manycore Clusters

18

0.4 Month 0.4 Month 0.4 Month 0.4 Month 0.4 Month

Making Car BodyPut on Front

and rear windows

Interior Finishing

Exterior Finishing

Confirming Functions

Page 19: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Assembly Line Work In this method, we can make: the first car in 2 months. the second car in 2.4 months. the third car in 2.8 months. the forth car in 3.2 months. the firth car in 3.4 months the sixth car in 3.8 months. Efficiency: 0.63 month per car.

Introduction to Parallel Programming for Multicore/Manycore Clusters

19

車体作成

フロント・バックガラスをつける

内装 外装機能確

車体作成

フロント・バックガラスをつける

内装 外装機能確

車体作成

フロント・バックガラスをつける

内装 外装機能確

Time車体作成

フロント・バックガラスをつける

内装 外装機能確

車体作成

フロント・バックガラスをつける

内装 外装機能確

FirstSecondThirdForthFifth

Each worker keeps working for 0.4 month when time is enough large. -> Very high efficiency.

This process calls<Pipeline Process> .

Page 20: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Pipeline Process on Computers1. Hardware Pipelining Performing it by computer hardware. Typical processes:

1. Computation process with pipelining.2. Data sending (operand code, data) from memory with pipelining.

2. Software Pipelining Performing it by writing programs. Typical processes:

1. Pipelining by compiler.(Instruction preload, data prefetch, data post-store. )

2. Pipelining by program modification by hand. (Data preload, loop unrolling)

Introduction to Parallel Programming for Multicore/Manycore Clusters

20

Page 21: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

In Case of Arithmetic Unit (AU) Ex) Process for AU (Note) Process for real AU is different from follows.)

Ex) Matrix-vector multiplication:for (j=0; j<n; j++)

for (i=0; i<n; i++) {y[j] += A[j][i] * x[i] ;

}

If we do not use pipelining, the process is like as follow.

Introduction to Parallel Programming for Multicore/Manycore Clusters

21

Fetch data A from memory

Fetch data B from memory

Issue Computation Store Result

Time

Performing AU.

Fetch data A from memory

Fetch data B from memory

Issue Computation Store Result

Fetch data A from memory

Fetch data B from memory

Page 22: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

In Case of Arithmetic Unit (AU) This is using only one unit time during 4 units time in AU;

hence poor efficiency > Computation efficiency: ¼=25%. If we establish pipelining as follows, computation is made

in every time unit when enough time is passed.> Computation efficiency: 100%

Introduction to Parallel Programming for Multicore/Manycore Clusters

22 Time…

The enough time is enough large loop length. Large loop length of N makes no delay pipelining, then it makes high efficiency. > If N is small, efficiency goes poor.

Fetch data A from memory

Fetch data B from memory

Issue Computation Store Result

Fetch data A from memory

Fetch data B from memory

Issue Computation Store Result

Fetch data A from memory

Fetch data B from memory

Issue Computation Store Result

Fetch data A from memory

Fetch data B from memory

Issue Computation Store Result

Fetch data A from memory

Fetch data B from memory

Issue Computation Store Result

Fetch data A from memory

Fetch data B from memory

Issue Computation Store Result

Page 23: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Summary of Computation Pipelining A Concept to establish full usage of AU.

> To do high performance Computing. Time of data fetch from main memory is very slow. Making

scheduling of load instruction by pipelining, the fetch time can be hidden. > AU is working AU in every time unit.

In real execution, it is not easy because of the following issues. 1. Delay by organization of computer architectures, such as restriction of

number of registers, data supply limitation for from memory to CPU, and from CPU to memory, etc. *CPU of the FX100 is based on Sparc64.

2. Process for loops, such as initialization and addition for loop induction variables ( I , j ), evaluation of loop finalizing.

3. Computation of memory address to access data of arrays.4. Whether compiler generates pipelined codes or not.

Page 24: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Note for Actual CPUs In actual CPUs, they have independent pipelines for:

1. Addition and Substitution, and2. Multiplication.

Moreover, there are several instructions for simultaneous pipelining operations.

In the Intel Pentium4, number of pipelining depths is 31! Time to full working of pipelining is very large. If many branches, prediction misses for instructions are

happened, efficiency goes poor.

Recent multi-core and many core CPUs require small depth of pipelining because of low frequency. Xeon Phi is 7 for depth of pipelining.

Introduction to Parallel Programming for Multicore/Manycore Clusters

24

Page 25: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Hardware Information for the FX10 8 computations per clock can be performed. 2 multiplications and adds per FMA.(4 Floating Point Operations)

2 FMAs is working per clock. 4 Floating Point Operations ×2FMAs = 8 Floating Operations

per clock. It has1.848 GHz clock per core, hence: Theoretical peak performance is:

1.848 GHz * 8 times = 14.784 GFLOPS / core 1 Node (16 cores)

14.784 * 16 cores = 236.5 GFLOPS / node

Number of registers for floating point operations 256 registers / core

Introduction to Parallel Programming for Multicore/Manycore Clusters

25

Page 26: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Loop UnrollingCompilers do not always perform this automatically.

Introduction to Parallel Programming for Multicore/Manycore Clusters

26

Page 27: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

What is “Loop Unrolling”?

It is a tuning technique in compilers by rewriting codes to enhance: 1. Register allocations for data;2. Pipelining;

Changing loop stride to m, not for 1. <Depth m unrolling>

Introduction to Parallel Programming for Multicore/Manycore Clusters

27

Page 28: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Loop Unrolling In technical term for compiler, it is

defined by unrolling the most inner loop. Narrow sense definition of loop unrolling.

Computation scientists defines unrolling multiple loops. Broad sense definition for loop unrolling. This is defined as a loop restructuring by

compiler optimizations.

Introduction to Parallel Programming for Multicore/Manycore Clusters

28

Page 29: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

An Example of Loop Unrolling(Matrix-Matrix Multiplications, C Language) k-loop Depth 2 Unrolling, where is n is dividable by 2.

for (i=0; i<n; i++) for (j=0; j<n; j++)

for (k=0; k<n; k+=2) C[i][j] += A[i][k] *B[k][j] + A[i][k+1]*B[k+1][j];

Number of times for loop counter checking in k-loopwill be half.

Introduction to Parallel Programming for Multicore/Manycore Clusters

29

Page 30: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

j-loop Depth 2 Unrolling, where is n is dividable by 2.

for (i=0; i<n; i++) for (j=0; j<n; j+=2)

for (k=0; k<n; k++) {C[i][j ] += A[i][k] *B[k][j ];C[i][j+1] += A[i][k] *B[k][j+1];

}

By allocating A[i][k] into a register, it can be accessed with high speed.

Introduction to Parallel Programming for Multicore/Manycore Clusters

30

An Example of Loop Unrolling(Matrix-Matrix Multiplications, C Language)

Page 31: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

i-loop Depth 2 Unrolling, where is n is dividable by 2.

for (i=0; i<n; i+=2) for (j=0; j<n; j++)

for (k=0; k<n; k++) {C[i ][j] += A[i ][k] *B[k][j];C[i+1][j] += A[i+1][k] *B[k][j];

}

By allocating B[i][k] into a register, it can be accessed with high speed.

Introduction to Parallel Programming for Multicore/Manycore Clusters

31

An Example of Loop Unrolling(Matrix-Matrix Multiplications, C Language)

Page 32: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

i-loop and j-loop Depth 2 Unrolling, where is n is dividable by 2.for (i=0; i<n; i+=2)

for (j=0; j<n; j+=2) for (k=0; k<n; k++) {

C[i ][j ] += A[i ][k] *B[k][j ];C[i ][j+1] += A[i ][k] *B[k][j+1]; C[i+1][j ] += A[i+1][k] *B[k][j ]; C[i+1][j+1] += A[i+1][k] *B[k][j +1];

} By allocating A[i][j], A[i+1][k],B[k][j],B[k][j+1] into registers,

it can be accessed with high speed.Introduction to Parallel Programming for

Multicore/Manycore Clusters32

An Example of Loop Unrolling(Matrix-Matrix Multiplications, C Language)

Page 33: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

To do easy understanding for compilers, the following code is better in some situations.for (i=0; i<n; i+=2)

for (j=0; j<n; j+=2) {

dc00 = C[i ][j ]; dc01 = C[i ][j+1]; dc10 = C[i+1][j ]; dc11 = C[i+1][j+1] ; for (k=0; k<n; k++) {

da0= A[i ][k] ; da1= A[i+1][k] ;db0= B[k][j ]; db1= B[k][j+1]; dc00 += da0 *db0; dc01 += da0 *db1; dc10 += da1 *db0; dc11 += da1 *db1;

}C[i ][j ] = dc00; C[i ][j+1] = dc01; C[i+1][j ] = dc10; C[i+1][j+1] = dc11;

}

Introduction to Parallel Programming for Multicore/Manycore Clusters

33

An Example of Loop Unrolling(Matrix-Matrix Multiplications, C Language)

Page 34: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

k-loop Depth 2 Unrolling, where is n is dividable by 2.

do i=1, ndo j=1, n

do k=1, n, 2C(i, j) = C(i, j) +A(i, k) *B(k, j) + A(i, k+1)*B(k+1, j)

enddoenddo

enddoNumber of times for loop counter checking in k-loop will

be half.

Introduction to Parallel Programming for Multicore/Manycore Clusters

34

An Example of Loop Unrolling(Matrix-Matrix Multiplications, Fortran Language)

Page 35: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

An Example of Loop Unrolling(Matrix-Matrix Multiplications, Fortran Language)

j-loop Depth 2 Unrolling, where is n is dividable by 2.

do i=1, ndo j=1, n, 2

do k=1, nC(i, j ) = C(i, j ) +A(i, k) * B(k, j )C(i, j+1) = C(i, j+1) +A(i, k) * B(k, j+1)

enddoenddo

enddo By allocating A(i , k) into a register, it can be accessed with high

speed. Introduction to Parallel Programming for

Multicore/Manycore Clusters35

Page 36: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

An Example of Loop Unrolling(Matrix-Matrix Multiplications, Fortran Language)

i-loop Depth 2 Unrolling, where is n is dividable by 2.

do i=1, n, 2do j=1, n

do k=1, nC(i , j) = C(i , j) +A(i , k) * B(k , j)C(i+1, j) = C(i+1, j) +A(i+1, k) * B(k , j)

enddoenddo

enddo By allocating B(i, k) into a register, it can be accessed with high

speed.Introduction to Parallel Programming for

Multicore/Manycore Clusters36

Page 37: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

An Example of Loop Unrolling(Matrix-Matrix Multiplications, Fortran Language)

i-loop and j-loop Depth 2 Unrolling, where is n is dividable by 2.do i=1, n, 2

do j=1, n, 2do k=1, nC(i , j ) = C(i , j ) +A(i , k) *B(k, j )C(i , j+1) = C(i , j+1) +A(i , k) *B(k, j+1) C(i+1, j ) = C(i+1, j ) +A(i+1, k) *B(k, j ) C(i+1, j+1) =C(i+1, j+1) +A(i+1, k) *B(k, j +1)

enddo; enddo; enddo; By allocating A(i, j), A(i+1, k),B(k, j),B(k, j+1) into registers,

it can be accessed with high speed.Introduction to Parallel Programming for

Multicore/Manycore Clusters37

Page 38: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

An Example of Loop Unrolling(Matrix-Matrix Multiplications, Fortran Language) To do easy understanding for compilers, the following

code is better in some situations.do i=1, n, 2

do j=1, n, 2

dc00 = C(i ,j ); dc01 = C(i ,j+1) dc10 = C(i+1,j ); dc11 = C(i+1,j+1) do k=1, n

da0= A(i ,k); da1= A(i+1, k)db0= B(k ,j ); db1= B(k, j+1) dc00 = dc00+da0 *db0; dc01 = dc01+da0 *db1; dc10 = dc10+da1 *db0; dc11 = dc11+da1 *db1;

enddoC(i , j ) = dc00; C(i , j+1) = dc01C(i+1, j ) = dc10; C(i+1, j+1) = dc11

enddo; enddo

Introduction to Parallel Programming for Multicore/Manycore Clusters

38

Page 39: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Continual Accesses for Arrays

Stride access makes poor performance.

Introduction to Parallel Programming for Multicore/Manycore Clusters

39

Page 40: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Storage Formats for ArrayC LanguageA[i][j]

Introduction to Parallel Programming for Multicore/Manycore Clusters

40

1

5

9

13

2

6

10

14

3

7

11

15

4

8

12

16

Direction of Continual Access

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Direction ofContinual Access

Fortran LanguageA(i, j)

i

j

i

j

Page 41: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Cache and Cache Line Data mapping methods between main memory and cache. From main memory to cache:

Direct Mapping: Direct mapping by unit (memory bank). Set Associative: Indirect mapping with hash function.

From cache to main memory: Store Through:Writing content to main memory as soon as data is

written in cache. Store In:Writing content to main memory when data on cache line is

replaced.

Introduction to Parallel Programming for Multicore/Manycore Clusters

41…

Main Memory

MemoryBank

Line 0Line 1Line 2Line 3Line 4Line 5

Cache

A MappingFunction

CacheLines

Page 42: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Cache Line Conflict We consider direct mapping, which is directly mapping from

address on main memory to cache. The mapping stride is 4 in this example.

Data on main memory maps to same cache line with stridden 4.

In this example, opposite access is made to direction of continuous access. >When we use C language, i-direction is accessed.

Introduction to Parallel Programming for Multicore/Manycore Clusters

42

Main Memory

Line 0Line 1Line 2Line 3

CacheCacheLines

1 2 3 45 6 7 89 10 11 1213 14 15 16

ContinuousAccess

Access Direction

Page 43: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

1 2 3 45 6 7 89 10 11 1213 14 15 16

Cache line Conflict1. In this situation, data 5 is accessed as soon as data 1 is

stored in cache line 0. Hence data 1 in cache line 0should be gone out immediately.

2. As same as 1, data 9 is accessed as soon as data 5 is stored in cache line 0. Hence data 5 in cache line 0 should be gone out immediately.

Introduction to Parallel Programming for Multicore/Manycore Clusters

43

Main Memory

Line 0Line 1Line 2Line 3

CacheCacheLines

ContinuousAccess

159

To registers

Access Direction

Page 44: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Cache line Conflict The 1 and 2 are continuous happened in this example. Data transfer line is full due to data movement from

main memory to cache. Hence it is impossible to additional data movement.

This is as same as sequential access from main memory. It is same situation for no cache. Data cannot reach to arithmetic unit when it needs,

then computation is stopped. Hence performance goes down.

This phenomena is called by <Cache Thrashing> or <Cache line Conflict>.

Introduction to Parallel Programming for Multicore/Manycore Clusters

44

Page 45: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Memory Interleaving In case of physical direct access for array. When data is accesses, it is highly happened by near bank

accesses from current access bank. With using this, there is a function to move data on the near banks to cache lines.

When data on line 0 is accessed, it can perform parallel data transfer from a near bank to line 1. This calls <Memory Interleaving>.

From viewpoint of arithmetic unit (AU), time of data access can be shorten.

Time for idol time of AU can be shorten. >This makes high efficiency.

Introduction to Parallel Programming for Multicore/Manycore Clusters

45

Make continual access loop for data storage in your program!

Page 46: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Condition of Cache Line Conflict Memory bank allocation to cache lines is usually

done with power of 2. For example, 32, 64, and 128. If performance goes down with factor of 1/2~1/3,

sometimes 1/10, in dedicated size of array, such as 1024, cache line conflict may be happened. Actual program behaves very complex, hence it is

difficult to find condition of cache line conflict. But, Allocate array with size of power of 2 should be avoided.

Introduction to Parallel Programming for Multicore/Manycore Clusters

46

Page 47: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

How to avoid cache line conflict? To avoid cache line conflict, we can apply the follows:

1. Patting:Allocate additional contents of array to avoid size of power of 2. Then use a part of the allocated array.

There is a compile option for patting.

2. Data Compression:Allocate a new array to use it for computation, and move data to the newly allocated array.

3. Computation Prediction:A routine to predict cache line conflict is included to program. When allocate array, the routine is called and checked the cache line conflict.

Introduction to Parallel Programming for Multicore/Manycore Clusters

47

Page 48: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Blocking

Data reuse for small access area

Introduction to Parallel Programming for Multicore/Manycore Clusters

48

Page 49: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Access Localization by Blocking There is a size for cache. Even in performing continuous access, data goes out

from cache when access area exceeds cache size. If data goes out from cache in frequently, it means

same situation from main memory to cache. In this situation, no advantage for high speed cache is obtained.

Hence the followings is needed to do high performance1. Store full data to cache lines;2. Reuse data on the cache lines in many times;

Introduction to Parallel Programming for Multicore/Manycore Clusters

49

Page 50: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

An Example of Cache miss hits by Blocking

Matrix-matrix Multiplications Matrix Size:8×8 double A[8][8];

Number of Cache Lines: 4 4 elements of array can be on a cache line. Cache Line Amount:4×8 Bytes (double)=32 Bytes

Row wise is continual access for array. (C language) Cache Replacement Algorithm::

Least Recently Used (LRU) Introduction to Parallel Programming for

Multicore/Manycore Clusters50

Page 51: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Relationship between Organizations ofArray and Cache Lines In this situation, relationship between array and cache lines is

as follow. We do not consider cache line conflict.

Introduction to Parallel Programming for Multicore/Manycore Clusters

51

In C Language: Arrays: A[i][j]、B[i][j]、C[i][j]

i

j

Storage Direction

Organization ofCache Lines

1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16

17 18 19 20 21 22 23 24

25 26 27 28 29 30 31 32

33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48

49 50 51 52 53 54 55 56

57 58 59 60 61 62 63 64

234

Elements of 1×4 in arrays are on each cache line. Which line is used is determined

by <Access pattern of the arrays>and <Cache Replacement Algorithm>.

Page 52: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Matrix-matrix Multiplications (Non Blocking)

Introduction to Parallel Programming for Multicore/Manycore Clusters

52

C A B

Cache LineNumber of cache linesIs 4. Cache line replacementAlgorithm is LRU.

Cache Miss ①

Line 1Line 2Line 3Line 4

Cache Miss ②

Cache Miss ③

Cache Miss ④

Cache Miss ⑤

LRU: Data on line that does not access recently goes out.

Page 53: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Matrix-matrix Multiplications (Non Blocking)

Introduction to Parallel Programming for Multicore/Manycore Clusters

53

C A B

キャッシュライン

Cache Miss ⑥

Cache Miss ⑦

Cache Miss ⑧

Cache Miss ⑨

Cache Miss ⑩

Cache Miss 11

Number of cache linesIs 4. Cache line replacementAlgorithm is LRU.

Line 1Line 2Line 3Line 4

Page 54: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Matrix-matrix Multiplications (Non Blocking)

Introduction to Parallel Programming for Multicore/Manycore Clusters

54

C A B

キャッシュライン

Cache Miss Cache Miss Cache Miss

Cache Miss Cache Miss

Cache MissCache Miss

Cache Miss Cache Miss

Cache Miss

Cache Miss

Number of Cache Miss hit is 22during 2 elements of C.

Number of cache linesIs 4. Cache line replacementAlgorithm is LRU.

Line 1Line 2Line 3Line 4

Page 55: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Matrix-matrix Multiplications(Blocking: 2 elements)

Introduction to Parallel Programming for Multicore/Manycore Clusters

55

C A B

Cache Line

Cache Miss Cache Miss Cache Miss

Cache Miss Cache Miss

Cache Miss

Computing with thisblocking unit

1 2

Number of cache linesIs 4. Cache line replacementAlgorithm is LRU.

Line 1Line 2Line 3Line 4

Page 56: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Matrix-matrix Multiplications(Blocking: 2 elements)

Introduction to Parallel Programming for Multicore/Manycore Clusters

56

C A B

Cache Line

Cache Miss Cache Miss

Cache Miss

Cache Miss

Cache Miss

Cache Miss

Computing with thisblocking unit

③ ④

Number of cache linesIs 4. Cache line replacementAlgorithm is LRU.

Line 1Line 2Line 3Line 4

Number of Cache Miss hit is 10during 2 elements of C.

Page 57: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Matrix-Matrix Multiplication Code(C Language): Without blockingCode Example:

for (i=0; i<n; i++) for (j=0; j<n; j++)

for (k=0; k<n; k++) C[i][j] += A[i][k] *B[k][j];

Introduction to Parallel Programming for Multicore/Manycore Clusters

57

C A Bi

j

i

k

k

j

Page 58: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Matrix-Matrix Multiplication Code(C Language) : With blocking When n is not dividable for the blocking size (ibl=16), it

make the following six-nested loop:

Introduction to Parallel Programming for Multicore/Manycore Clusters

58

ibl = 16;for ( ib=0; ib<n; ib+=ibl ) {

for ( jb=0; jb<n; jb+=ibl ) {for ( kb=0; kb<n; kb+=ibl ) {for ( i=ib; i<ib+ibl; i++ ) {for ( j=jb; j<jb+ibl; j++ ) {for ( k=kb; k<kb+ibl; k++ ) {C[i][j] += A[i][k] * B[k][j];

} } } } } }

Page 59: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Matrix-Matrix Multiplication Code(Fortran Language) : With blocking When n is not dividable for the blocking size (ibl=16), it

make the following six-nested loop:

Introduction to Parallel Programming for Multicore/Manycore Clusters

59

ibl = 16do ib=1, n, ibl

do jb=1, n, ibldo kb=1, n, ibl

do i=ib, ib+ibl-1do j=jb, jb+ibl-1

do k=kb, kb+ibl-1C(i, j) = C(i, j) + A(i, k) * B(k, j)

enddo; enddo; enddo; enddo; enddo; enddo;

Page 60: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Data Access Pattern in Case of Cache Blocking

Introduction to Parallel Programming for Multicore/Manycore Clusters

60

C A B

= ×

ibl

ibl

ibl

ibl

ibl

ibl

Perform matrix-matrixMultiplication with small matrix ibl×ibl

Page 61: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Data Access Pattern in Case of Cache Blocking

Introduction to Parallel Programming for Multicore/Manycore Clusters

61

C A B

= ×

ibl

ibl

ibl

ibl ibl

ibl

Perform matrix-matrixMultiplication with small matrix ibl×ibl

Page 62: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Unrolling for Blocked Matrix-matrix Multiplication (C Language) Unrolling for 6-nested loop to blocked matrix-matrix

multiplication (6-nested loop) can be adapted. i-loop and j-loop with depth 2 unrolling is as follows, where

the blocking depth ibl can be divided by 2.

Introduction to Parallel Programming for Multicore/Manycore Clusters

62

ibl = 16;for (ib=0; ib<n; ib+=ibl) {for (jb=0; jb<n; jb+=ibl) {

for (kb=0; kb<n; kb+=ibl) {for (i=ib; i<ib+ibl; i+=2) {for (j=jb; j<jb+ibl; j+=2) {

for (k=kb; k<kb+ibl; k++) {C[i ][j ] += A[i ][k] * B[k][j ];C[i+1][j ] += A[i+1][k] * B[k][j ];C[i ][j+1] += A[i ][k] * B[k][j+1];C[i+1][j+1] += A[i+1][k] * B[k][j+1];

} } } } } }

Page 63: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Unrolling for Blocked Matrix-matrix Multiplication (Fortran Language)

Unrolling for 6-nested loop to blocked matrix-matrix multiplication (6-nested loop) can be adapted.

i-loop and j-loop with depth 2 unrolling is as follows, where the blocking depth ibl can be divided by 2.

Introduction to Parallel Programming for Multicore/Manycore Clusters

63

ibl = 16do ib=1, n, ibl

do jb=1, n, ibldo kb=1, n, ibldo i=ib, ib+ibl, 2

do j=jb, jb+ibl, 2do k=kb, kb+ibl

C(i , j ) = C(i , j ) + A(i , k) * B(k, j )C(i+1, j ) = C(i+1, j ) + A(i+1, k) * B(k, j )C(i , j+1) = C(i , j+1) + A(i , k) * B(k, j+1)C(i+1, j+1) = C(i+1, j+1) + A(i+1, k) * B(k, j+1)

enddo; enddo; enddo; enddo; enddo; enddo;

Page 64: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Other Techniques to Establish Speedups

Introduction to Parallel Programming for Multicore/Manycore Clusters

64

Page 65: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Delete Common Statements (1) There is redundant computation in the following program.

d = a + b + c;f = d + a + b;

Some compiler optimizes this, but the following is better.

temp = a + b;d = temp + c;f = d + temp;

Introduction to Parallel Programming for Multicore/Manycore Clusters

65

Page 66: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Delete Common Statements (2) It is better to avoid redundant computation for access of array.

for (i=0; i<n; i++) {xold[i] = x[i];x[i] = x[i] + y[i];

} The following is the one.

for (i=0; i<n; i++) {dtemp = x[i];xold[i] = dtemp;x[i] = dtemp + y[i];

}

Introduction to Parallel Programming for Multicore/Manycore Clusters

66

Page 67: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Movement of Codes Division takes much time. Do not write it inside loop if possible.

for (i=0; i<n; i++) {a[i] = a[i] / sqrt(dnorm);

} In the case of above, multiplication should be adapted.

dtemp = 1.0 / sqrt(dnorm);for (i=0; i<n; i++) {

a[i] = a[i] *dtemp;}

Introduction to Parallel Programming for Multicore/Manycore Clusters

67

Page 68: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

IF Sentences Inside Loop Do not write if sentences as much as possible.

for (i=0; i<n; i++) {for (j=0; j<n; j++) {

if ( i != j ) A[i][j] = B[i][j];else A[i][j] = 1.0;

} }

In this case, the following is better.for (i=0; i<n; i++) {

for (j=0; j<n; j++) {A[i][j] = B[i][j];

} }for (i=0; i<n; i++) A[i][i] = 1.0;

Introduction to Parallel Programming for Multicore/Manycore Clusters

68

Page 69: Introduction to High Performance Programming...Introduction to High Performance Programming 名古屋大学情報基盤中心 教授 片桐孝洋 Takahiro Katagiri, Professor, Information

Strengthen Software Pipelining

Introduction to Parallel Programming for Multicore/Manycore Clusters

69

for (i=0; i<n; i+=2) {dtmpb0 = b[i];dtmpc0 = c[i];dtmpa0 = dtmpb0 + dtmpc0;a[i] = dtmpa0;dtmpb1 = b[i+1];dtmpc1 = c[i+1];dtmpa1 = dtmpb1 + dtmpc1;a[i+1] = dtmpa1;

}

for (i=0; i<n; i+=2) { dtmpb0 = b[i];dtmpb1 = b[i+1];dtmpc0 = c[i];dtmpc1 = c[i+1];dtmpa0 = dtmpb0 + dtmpc0;dtmpa1 = dtmpb1 + dtmpc1;a[i] = dtmpa0;a[i+1] = dtmpa1;

}

Original Code(Depth 2 unrolling)

Code for Strengthen Softwarepipelining (Depth 2 unrolling)

Distance between definition-use is short.> Noting can be adaptedin viewpoint of software.

Distance between definition-use is large.> Causing many opportunities for software pipelining.