Upload
vittorio-giovara
View
1.776
Download
3
Embed Size (px)
DESCRIPTION
Slides from the thesis defence in Chicago by Vittorio Giovara.
Citation preview
Parallel and Distributed Computing on Low
Latecy Clusters
Vittorio GiovaraM. S. Electrical Engineering and Computer Science
University of Illinois at ChicagoMay 2009
Contents• Motivation
• Strategy
• Technologies
• OpenMP
• MPI
• Infinband
• Application
• Compiler Optimizations
• OpenMP and MPI over Infinband
• Results
• Conclusions
Motivation
Motivation
• Scaling trend has to stop for CMOS technology:✓ Direct-tunneling limit in SiO2 ~3 nm
✓ Distance between Si atoms ~0.3 nm
✓ Variabilty
• Foundamental reason: rising fab cost
Motivation
• Easy to build multiple core processor
• Requires human action to modify and adapt concurrent software
• New classification for computer architectures
data pool
inst
ruct
ion
pool
CPU
CPU
MISD
data pool
CPU
inst
ruct
ion
pool
SISDdata pool
inst
ruct
ion
pool
CPU CPU
SIMD
data pool
inst
ruct
ion
pool
CPU CPU
CPUCPU
MIMD
Classification
easier to parallelize
abstraction level
algorithm
loop level
process management
data dependencybranching overhead
control flowalgorithm
loop level
process management
recursionmemory
managementprofiling
SMP MultiprogrammingMultithreading and Scheduling
Levels
Backfire
• Difficutly to fully exploit the parallelism offered
• Automatic tools required to adapt software to parallelism
• Compiler support for manual or semi-automatic enhancement
Applications
• OpenMP and MPI are two popular tools used to simplify the parallelizing process of both new and old software
• Mathematics and Physics
• Computer Science
• Biomedics
Specific Problem and Background
• Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering)
• Computationally intensive (even days of CPU); speedup required
• Previous works still not fully encompassing the problem (no Infiniband or OpenMP+MPI solutions)
Strategy
Strategy
• Install a Linux Kernel with ad-hoc configuration for scientific computation
• Compile a OpenMP enable GCC (supported from 4.3.1 onwards)
• Add the Infiniband link among clusters with proper drivers in kernel and user space
• Select a MPI implementation library
Strategy
• Verify Infiniband network through some MPI test examples
• Install the target software
• Proceed to include OpenMP and MPI directives in the code
• Run test cases
OpenMP
• standard
• supported by most of modern compilers
• requires little knowledge of the software
• very simple construction methods
OpenMP - example
Parallel Task 1
Parallel Task 2 Parallel Task 4
Parallel Task 3
OpenMP - example
Master Thread
Parallel Task 1 Parallel Task 2
Parallel Task 3
Parallel Task 4
Thread B
Thread A
Join
OpenMP Sceduler
• Which scheduler available for hardware?
- Static
- Dynamic
- Guided
OpenMP Scheduler
0
10000
20000
30000
40000
50000
60000
70000
80000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OpenMP Static Scheduler Chart
mic
rose
cond
s
number of threads
chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
OpenMP Scheduler
0
14625
29250
43875
58500
73125
87750
102375
117000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OpenMP Dynamic Scheduler Chart
mic
rose
cond
s
number of threads
chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
OpenMP Scheduler
0
10000
20000
30000
40000
50000
60000
70000
80000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OpenMP Guided Scheduler Chart
mic
rose
cond
s
number of threads
chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000
OpenMP Scheduler
OpenMP Scheduler
static scheduler dynamic scheduler guided scheduler
MPI
• standard
• widely used in cluster environment
• many transport link supported
• different implementations available
- OpenMPI
- MVAPICH
Infiniband
• standard
• widely used in cluster environment
• very low latency for small packets
• up to 16 Gb/s transfer speed
1,0 µs
10,0 µs
100,0 µs
1000,0 µs
10000,0 µs
100000,0 µs
1000000,0 µs
10000000,0 µs
1 kB
2 kB
4 kB
8 kB
16 kB
32 kB
64 kB
128
kB
256
kB
512
kB1
MB
2 M
B4
MB
8 M
B
16 M
B
32 M
B
64 M
B
128
MB
256
MB
512
MB1
GB2
GB4
GB8
GB
16 G
B
OpenMPI Mvapich2
MPI over Infiniband
MPI over Infiniband
1,00 µs
10,00 µs
100,00 µs
1000,00 µs
10000,00 µs
100000,00 µs
1000000,00 µs
10000000,00 µs
1 kB
2 kB
4 kB
8 kB
16 kB
32 kB
64 kB
128
kB
256
kB
512
kB1
MB
2 M
B4
MB
8 M
B
OpenMPI Mvapich2
Optimizations
• Active at compile time
• Available only after porting the software to standard FORTRAN
• Consistent documentation available
• Unexpected positive results
Optimizations
•-march = native
•-O3
•-ffast-math
•-Wl,-O1
Target Software
Target Software
• Sally3D
• micromagnetic equation solver
• written in FORTRAN with some C libraries
• program uses linear formulation of mathematical models
parallel loop
OpenMP Threadsdistributed loop
Host 1 Host 2OpenMP Threads OpenMP Threads
MPI
sequential loop
standard programming
model
Implementation Scheme
Implementation Scheme
• Data Structure: not embarrassingly parallel
• Three dimensional matrix
• Several temporary arrays – synchronization obiects required
➡ send() and recv() mechanism
➡ critical regions using OpenMP directives
➡ functions merging
➡ matrix conversion
Results
ResultsOMP MPI OPT seconds
* * * 133* * - 400* - * 186* - - 487- * * 200- * - 792- - * 246- - - 1062
Total Speed Increase: 87.52%
Actual ResultsOMP MPI seconds
* * 59* - 129- * 174- - 249
Function Namecalc_intmuduacalc_hdmg_tetcalc_muduacampo_effettivo
Normal OpenMP MPI OpenMP+MPI24.5 s 4.7 s 14.4 s 2.8 s16.9 s 3.0 s 10.8 s 1.7 s12.1 s 1.9 s 7.0 s 1.1 s17.7 s 4.5 s 9.9 s 2.3 s
Actual Results
Total Raw Speed Increment: 76%
• OpenMP – 6-8x
• MPI – 2x
• OpenMP + MPI – 14 - 16x
Conclusions
Conclusions and Future Works
• Computational time has been significantly decreased
• Speedup is consistent with expected results
• Submitted to COMPUMAG ‘09
• Continue inserting OpenMP and MPI directives• Perform algorithm optimizations• Increase cluster size