37
Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Embed Size (px)

Citation preview

Page 1: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Improving Parallel Performance

Intel Software College

Introduction to Parallel Programming – Part 7

Page 2: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

2Improving Parallel Performance

Intel® Software College

Objectives

At the end of this module, you should be able to

Give two reasons why one sequential algorithm may more suitable than another for parallelization

Use loop fusion, loop fission, and loop inversion to create or improve opportunities for parallel execution

Explain the pros and cons of static versus dynamic loop scheduling

Explain why it can be difficult both to optimize load balancing and maximize locality

Page 3: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

3Improving Parallel Performance

Intel® Software College

General Rules of Thumb

Start with best sequential algorithm

Maximize locality

Page 4: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

4Improving Parallel Performance

Intel® Software College

Start with Best Sequential AlgorithmDon’t confuse “speedup” with “speed”

Speedup: ratio of program’s execution time on 1 processor to its execution time on p processors

What if start with inferior sequential algorithm?

Naïve, higher complexity algorithms

Easier to make parallel

Usually don’t lead to fastest parallel algorithm

Page 5: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

5Improving Parallel Performance

Intel® Software College

Example: Search for Chess Move

Naïve minimax algorithm

Exhaustive search of game tree

Branching factor around 35

Nodes evaluated in search of depth d: 35d

Alpha-beta search algorithm

Prunes useless subtrees

Branching factor around 6

Nodes evaluated in search of depth d: 6d

Page 6: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

6Improving Parallel Performance

Intel® Software College

Minimax Search

7

3

1

5

4

6

1

2

8

0

3

-5

-2

4

7

6

My move—choose max

His move—choose min

3

3 0

3 4 0 6

4 61 0 -5 -213

My move—choose max

His move—choose min

Page 7: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

7Improving Parallel Performance

Intel® Software College

Alpha-Beta Pruning

7

3

1 4

6

8

0

3

-5

3

3 0

3 4 0

4 0 -513

My move—choose max

His move—choose min

My move—choose max

His move—choose min

Page 8: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

8Improving Parallel Performance

Intel® Software College

How Deep the Search?

Nodes Evaluated

Minimaxon one

processor

Parallel minimax,

speedup 35

Alpha-beta pruning on one

processor

100,000 3 4 6

100 million 5 6 10

100 billion 7 8 14

Page 9: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

9Improving Parallel Performance

Intel® Software College

Maximize Locality

Temporal locality: If a processor accesses a memory location, there is a good chance it will revisit that memory location soon

Data locality: If a processor accesses a memory location, there is a good chance it will visit a nearby location soon

Programs tend to exhibit locality because they tend to have loops indexing through arrays

Principle of locality makes cache memory worthwhile

Page 10: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

10Improving Parallel Performance

Intel® Software College

Parallel Processing and Locality

Multiple processors multiple caches

When a processor writes a value, the system must ensure no processor tries to reference an obsolete value (cache coherence problem)

A write by one processor can cause the invalidation of another processor’s cache line, leading to a cache miss

Rule of thumb: Better to have different processors manipulating totally different chunks of arrays

We say a parallel program has good locality if processors’ memory writes tend not to interfere with the work being done by other processors

Page 11: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

11Improving Parallel Performance

Intel® Software College

Example: Array Initialization

for (i = 0; i < N; i++) a[i] = 0;

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Terrible allocation of work to processors

0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

Better allocation of work to processors...

unless sub-arrays map to same cache lines!

Page 12: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

12Improving Parallel Performance

Intel® Software College

Loop Transformations

Loop fission

Loop fusion

Loop inversion

Page 13: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

13Improving Parallel Performance

Intel® Software College

Loop Fission

Begin with single loop having loop-carried dependence

Split loop into two or more loops

New loops can be executed in parallel

Page 14: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

14Improving Parallel Performance

Intel® Software College

Before Loop Fission

float *a, *b;

int i;

for (i = 1; i < N; i++) {

if (b[i] > 0.0) a[i] = 2.0 * b[i];

else a[i] = 2.0 * fabs(b[i]);

b[i] = a[i-1];

}

Perfectlyparallel

Loop-carried dependence

Page 15: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

15Improving Parallel Performance

Intel® Software College

After Loop Fission

#pragma omp parallel

{

#pragma omp for

for (i = 1; i < N; i++) {

if (b[i] > 0.0) a[i] = 2.0 * b[i];

else a[i] = 2.0 * fabs(b[i]);

}

#pragma omp for

for (i = 1; i < N; i++) {

b[i] = a[i-1];

}

}

This works becausethere is a barriersynchronizationafter a parallel

for loop

Page 16: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

16Improving Parallel Performance

Intel® Software College

Loop Fission and Locality

Another use of loop fission is to increase data locality

Before fission, nested loops reference too many data values, leading to poor cache hit rate

Break nested loops into multiple nested loops

New nested loops have higher cache hit rate

Page 17: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

17Improving Parallel Performance

Intel® Software College

Before Fission

for (i = 0; i < list_len; i++)

for (j = prime[i]; j < N; j += prime[i])

marked[j] = 1;

marked

Page 18: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

18Improving Parallel Performance

Intel® Software College

After Fissionfor (k = 0; k < N; k += CHUNK_SIZE)

for (i = 0; i < list_len; i++) {

start = f(prime[i], k);

end = g(prime[i], k);

for (j = start; j < end; j += prime[i])

marked[j] = 1;

}

marked

etc.

Page 19: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

19Improving Parallel Performance

Intel® Software College

Loop Fusion

The opposite of loop fission

Combine loops increase grain size

Page 20: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

20Improving Parallel Performance

Intel® Software College

Before Loop Fusion

float *a, *b, x, y;

int i;

...

for (i = 0; i < N; i++) a[i] = foo(i);

x = a[N-1] – a[0];

for (i = 0; i < N; i++) b[i] = bar(a[i]);

y = x * b[0] / b[N-1];

Functions foo and bar are side-effect free.

Page 21: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

21Improving Parallel Performance

Intel® Software College

After Loop Fusion

#pragma omp parallel for

for (i = 0; i < N; i++) {

a[i] = foo(i);

b[i] = bar(a[i]);

}

x = a[N-1] – a[0];

y = x * b[0] / b[N-1];

Now one barrier instead of two

Page 22: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

22Improving Parallel Performance

Intel® Software College

Loop Inversion

Nested for loops may have data dependences that prevent parallelization

Inverting the nesting of for loops may

Expose a parallelizable loop

Increase grain size

Improve parallel program’s locality

Page 23: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

23Improving Parallel Performance

Intel® Software College

Before Loop Inversion

for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1];

Can executeinner loop inparallel, butgrain size small

Page 24: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

24Improving Parallel Performance

Intel® Software College

After Loop Inversion

#pragma omp parallel forfor (i = 0; i < m; i++) for (j = 1; j < n; j++) a[i][j] = 2 * a[i][j-1];

Can executeouter loop inparallel

Page 25: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

25Improving Parallel Performance

Intel® Software College

Reducing Parallel Overhead

Loop scheduling

Conditionally executing in parallel

Replicating work

Page 26: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

26Improving Parallel Performance

Intel® Software College

Loop Scheduling

Loop schedule: how loop iterations are assigned to threads

Static schedule: iterations assigned to threads before execution of loop

Dynamic schedule: iterations assigned to threads during execution of loop

Page 27: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

27Improving Parallel Performance

Intel® Software College

Loop Scheduling in OpenMP

Name Chunk Chunk Size

# Chunks

Static/Dynamic

Static No N/P P Static

Interleaved Yes C N/C Static

Dynamic Optional C N/C Dynamic

Guided Optional Decreasing < N/C Dynamic

Runtime No Varies Varies Varies

From Parallel Programming in OpenMP by Chandra et al.

Page 28: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

28Improving Parallel Performance

Intel® Software College

Loop Scheduling Example

#pragma omp parallel for

for (i = 0; i < 12; i++)

for (j = 0; j <= i; j++)

a[i][j] = ...;

Page 29: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

29Improving Parallel Performance

Intel® Software College

A B

C D

Page 30: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

30Improving Parallel Performance

Intel® Software College

Locality

Smaller Data Sets

Larger Data Sets

Load Balance

Locality v.Load Balance

Page 31: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

31Improving Parallel Performance

Intel® Software College

Conditionally Enable Parallelism

Suppose sequential loop has execution time jn

Suppose barrier synchronization time is kp

We should make loop parallel only if

OpenMP’s if clause lets us conditionally enable parallelism

)/11()/(

pj

kpnkppnj

Page 32: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

32Improving Parallel Performance

Intel® Software College

Example of if Clause

Suppose benchmarking shows a loop executes faster in parallel only when n > 1250

#pragma omp parallel for if (n > 1250)

for (i = 0; i < n; i++) {

...

}

Page 33: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

33Improving Parallel Performance

Intel® Software College

Replicate Work

Every thread interaction has a cost

Example: Barrier synchronization

Sometimes it’s faster for threads to replicate work than to go through a barrier synchronization

Page 34: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

34Improving Parallel Performance

Intel® Software College

Before Work Replication

for (i = 0; i < N; i++) a[i] = foo(i);

x = a[0] / a[N-1];

for (i = 0; i < N; i++) b[i] = x * a[i];

Both for loops are amenable to parallelization

Synchronization among threads required if x is shared and one thread performs assignment

Page 35: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

35Improving Parallel Performance

Intel® Software College

After Work Replication

#pragma omp parallel private (x)

{

x = foo(0) / foo(N-1);

#pragma omp for

for (i = 0; i < N; i++) {

a[i] = foo(i);

b[i] = x * a[i];

}

}

Page 36: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

36Improving Parallel Performance

Intel® Software College

References

Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon, Parallel Programming in OpenMP, Morgan Kaufmann (2001).

Peter Denning, “The Locality Principle,” Naval Postgraduate School (2005).

Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, McGraw-Hill (2004).

Page 37: Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7

Copyright © 2006, Intel Corporation. All rights reserved.

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

37Improving Parallel Performance

Intel® Software College