Improving Parallel Performance
Intel Software College
Introduction to Parallel Programming – Part 7
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
2Improving Parallel Performance
Intel® Software College
Objectives
At the end of this module, you should be able to
Give two reasons why one sequential algorithm may more suitable than another for parallelization
Use loop fusion, loop fission, and loop inversion to create or improve opportunities for parallel execution
Explain the pros and cons of static versus dynamic loop scheduling
Explain why it can be difficult both to optimize load balancing and maximize locality
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
3Improving Parallel Performance
Intel® Software College
General Rules of Thumb
Start with best sequential algorithm
Maximize locality
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
4Improving Parallel Performance
Intel® Software College
Start with Best Sequential AlgorithmDon’t confuse “speedup” with “speed”
Speedup: ratio of program’s execution time on 1 processor to its execution time on p processors
What if start with inferior sequential algorithm?
Naïve, higher complexity algorithms
Easier to make parallel
Usually don’t lead to fastest parallel algorithm
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
5Improving Parallel Performance
Intel® Software College
Example: Search for Chess Move
Naïve minimax algorithm
Exhaustive search of game tree
Branching factor around 35
Nodes evaluated in search of depth d: 35d
Alpha-beta search algorithm
Prunes useless subtrees
Branching factor around 6
Nodes evaluated in search of depth d: 6d
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
6Improving Parallel Performance
Intel® Software College
Minimax Search
7
3
1
5
4
6
1
2
8
0
3
-5
-2
4
7
6
My move—choose max
His move—choose min
3
3 0
3 4 0 6
4 61 0 -5 -213
My move—choose max
His move—choose min
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
7Improving Parallel Performance
Intel® Software College
Alpha-Beta Pruning
7
3
1 4
6
8
0
3
-5
3
3 0
3 4 0
4 0 -513
My move—choose max
His move—choose min
My move—choose max
His move—choose min
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
8Improving Parallel Performance
Intel® Software College
How Deep the Search?
Nodes Evaluated
Minimaxon one
processor
Parallel minimax,
speedup 35
Alpha-beta pruning on one
processor
100,000 3 4 6
100 million 5 6 10
100 billion 7 8 14
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
9Improving Parallel Performance
Intel® Software College
Maximize Locality
Temporal locality: If a processor accesses a memory location, there is a good chance it will revisit that memory location soon
Data locality: If a processor accesses a memory location, there is a good chance it will visit a nearby location soon
Programs tend to exhibit locality because they tend to have loops indexing through arrays
Principle of locality makes cache memory worthwhile
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
10Improving Parallel Performance
Intel® Software College
Parallel Processing and Locality
Multiple processors multiple caches
When a processor writes a value, the system must ensure no processor tries to reference an obsolete value (cache coherence problem)
A write by one processor can cause the invalidation of another processor’s cache line, leading to a cache miss
Rule of thumb: Better to have different processors manipulating totally different chunks of arrays
We say a parallel program has good locality if processors’ memory writes tend not to interfere with the work being done by other processors
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
11Improving Parallel Performance
Intel® Software College
Example: Array Initialization
for (i = 0; i < N; i++) a[i] = 0;
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Terrible allocation of work to processors
0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3
Better allocation of work to processors...
unless sub-arrays map to same cache lines!
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
12Improving Parallel Performance
Intel® Software College
Loop Transformations
Loop fission
Loop fusion
Loop inversion
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
13Improving Parallel Performance
Intel® Software College
Loop Fission
Begin with single loop having loop-carried dependence
Split loop into two or more loops
New loops can be executed in parallel
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
14Improving Parallel Performance
Intel® Software College
Before Loop Fission
float *a, *b;
int i;
for (i = 1; i < N; i++) {
if (b[i] > 0.0) a[i] = 2.0 * b[i];
else a[i] = 2.0 * fabs(b[i]);
b[i] = a[i-1];
}
Perfectlyparallel
Loop-carried dependence
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
15Improving Parallel Performance
Intel® Software College
After Loop Fission
#pragma omp parallel
{
#pragma omp for
for (i = 1; i < N; i++) {
if (b[i] > 0.0) a[i] = 2.0 * b[i];
else a[i] = 2.0 * fabs(b[i]);
}
#pragma omp for
for (i = 1; i < N; i++) {
b[i] = a[i-1];
}
}
This works becausethere is a barriersynchronizationafter a parallel
for loop
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
16Improving Parallel Performance
Intel® Software College
Loop Fission and Locality
Another use of loop fission is to increase data locality
Before fission, nested loops reference too many data values, leading to poor cache hit rate
Break nested loops into multiple nested loops
New nested loops have higher cache hit rate
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
17Improving Parallel Performance
Intel® Software College
Before Fission
for (i = 0; i < list_len; i++)
for (j = prime[i]; j < N; j += prime[i])
marked[j] = 1;
marked
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
18Improving Parallel Performance
Intel® Software College
After Fissionfor (k = 0; k < N; k += CHUNK_SIZE)
for (i = 0; i < list_len; i++) {
start = f(prime[i], k);
end = g(prime[i], k);
for (j = start; j < end; j += prime[i])
marked[j] = 1;
}
marked
etc.
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
19Improving Parallel Performance
Intel® Software College
Loop Fusion
The opposite of loop fission
Combine loops increase grain size
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
20Improving Parallel Performance
Intel® Software College
Before Loop Fusion
float *a, *b, x, y;
int i;
...
for (i = 0; i < N; i++) a[i] = foo(i);
x = a[N-1] – a[0];
for (i = 0; i < N; i++) b[i] = bar(a[i]);
y = x * b[0] / b[N-1];
Functions foo and bar are side-effect free.
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
21Improving Parallel Performance
Intel® Software College
After Loop Fusion
#pragma omp parallel for
for (i = 0; i < N; i++) {
a[i] = foo(i);
b[i] = bar(a[i]);
}
x = a[N-1] – a[0];
y = x * b[0] / b[N-1];
Now one barrier instead of two
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
22Improving Parallel Performance
Intel® Software College
Loop Inversion
Nested for loops may have data dependences that prevent parallelization
Inverting the nesting of for loops may
Expose a parallelizable loop
Increase grain size
Improve parallel program’s locality
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
23Improving Parallel Performance
Intel® Software College
Before Loop Inversion
for (j = 1; j < n; j++) #pragma omp parallel for for (i = 0; i < m; i++) a[i][j] = 2 * a[i][j-1];
Can executeinner loop inparallel, butgrain size small
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
24Improving Parallel Performance
Intel® Software College
After Loop Inversion
#pragma omp parallel forfor (i = 0; i < m; i++) for (j = 1; j < n; j++) a[i][j] = 2 * a[i][j-1];
Can executeouter loop inparallel
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
25Improving Parallel Performance
Intel® Software College
Reducing Parallel Overhead
Loop scheduling
Conditionally executing in parallel
Replicating work
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
26Improving Parallel Performance
Intel® Software College
Loop Scheduling
Loop schedule: how loop iterations are assigned to threads
Static schedule: iterations assigned to threads before execution of loop
Dynamic schedule: iterations assigned to threads during execution of loop
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
27Improving Parallel Performance
Intel® Software College
Loop Scheduling in OpenMP
Name Chunk Chunk Size
# Chunks
Static/Dynamic
Static No N/P P Static
Interleaved Yes C N/C Static
Dynamic Optional C N/C Dynamic
Guided Optional Decreasing < N/C Dynamic
Runtime No Varies Varies Varies
From Parallel Programming in OpenMP by Chandra et al.
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
28Improving Parallel Performance
Intel® Software College
Loop Scheduling Example
#pragma omp parallel for
for (i = 0; i < 12; i++)
for (j = 0; j <= i; j++)
a[i][j] = ...;
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
29Improving Parallel Performance
Intel® Software College
A B
C D
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
30Improving Parallel Performance
Intel® Software College
Locality
Smaller Data Sets
Larger Data Sets
Load Balance
Locality v.Load Balance
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
31Improving Parallel Performance
Intel® Software College
Conditionally Enable Parallelism
Suppose sequential loop has execution time jn
Suppose barrier synchronization time is kp
We should make loop parallel only if
OpenMP’s if clause lets us conditionally enable parallelism
)/11()/(
pj
kpnkppnj
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
32Improving Parallel Performance
Intel® Software College
Example of if Clause
Suppose benchmarking shows a loop executes faster in parallel only when n > 1250
#pragma omp parallel for if (n > 1250)
for (i = 0; i < n; i++) {
...
}
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
33Improving Parallel Performance
Intel® Software College
Replicate Work
Every thread interaction has a cost
Example: Barrier synchronization
Sometimes it’s faster for threads to replicate work than to go through a barrier synchronization
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
34Improving Parallel Performance
Intel® Software College
Before Work Replication
for (i = 0; i < N; i++) a[i] = foo(i);
x = a[0] / a[N-1];
for (i = 0; i < N; i++) b[i] = x * a[i];
Both for loops are amenable to parallelization
Synchronization among threads required if x is shared and one thread performs assignment
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
35Improving Parallel Performance
Intel® Software College
After Work Replication
#pragma omp parallel private (x)
{
x = foo(0) / foo(N-1);
#pragma omp for
for (i = 0; i < N; i++) {
a[i] = foo(i);
b[i] = x * a[i];
}
}
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
36Improving Parallel Performance
Intel® Software College
References
Rohit Chandra, Leonardo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, and Ramesh Menon, Parallel Programming in OpenMP, Morgan Kaufmann (2001).
Peter Denning, “The Locality Principle,” Naval Postgraduate School (2005).
Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, McGraw-Hill (2004).
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
37Improving Parallel Performance
Intel® Software College