Parallel and Distributed Computing on Low Latency Clusters

Parallel and Distributed Computing on Low

Latecy Clusters

Vittorio GiovaraM. S. Electrical Engineering and Computer Science

University of Illinois at ChicagoMay 2009

Contents• Motivation

• Strategy

• Technologies

• OpenMP

• MPI

• Infinband

• Application

• Compiler Optimizations

• OpenMP and MPI over Infinband

• Results

• Conclusions

Motivation

Motivation

• Scaling trend has to stop for CMOS technology:✓ Direct-tunneling limit in SiO2 ~3 nm

✓ Distance between Si atoms ~0.3 nm

✓ Variabilty

• Foundamental reason: rising fab cost

Motivation

• Easy to build multiple core processor

• Requires human action to modify and adapt concurrent software

• New classification for computer architectures

data pool

inst

ruct

ion

pool

CPU

CPU

MISD

data pool

CPU

inst

ruct

ion

pool

SISDdata pool

inst

ruct

ion

pool

CPU CPU

SIMD

data pool

inst

ruct

ion

pool

CPU CPU

CPUCPU

MIMD

Classification

easier to parallelize

abstraction level

algorithm

loop level

process management

data dependencybranching overhead

control flowalgorithm

loop level

process management

recursionmemory

managementprofiling

SMP MultiprogrammingMultithreading and Scheduling

Levels

Backfire

• Difficutly to fully exploit the parallelism offered

• Automatic tools required to adapt software to parallelism

• Compiler support for manual or semi-automatic enhancement

Applications

• OpenMP and MPI are two popular tools used to simplify the parallelizing process of both new and old software

• Mathematics and Physics

• Computer Science

• Biomedics

Specific Problem and Background

• Sally3D is a micromagnetism program suit for field analysis and modeling developed at Politecnico di Torino (Department of Electrical Engineering)

• Computationally intensive (even days of CPU); speedup required

• Previous works still not fully encompassing the problem (no Infiniband or OpenMP+MPI solutions)

Strategy

Strategy

• Install a Linux Kernel with ad-hoc configuration for scientific computation

• Compile a OpenMP enable GCC (supported from 4.3.1 onwards)

• Add the Infiniband link among clusters with proper drivers in kernel and user space

• Select a MPI implementation library

Strategy

• Verify Infiniband network through some MPI test examples

• Install the target software

• Proceed to include OpenMP and MPI directives in the code

• Run test cases

OpenMP

• standard

• supported by most of modern compilers

• requires little knowledge of the software

• very simple construction methods

OpenMP - example

Parallel Task 1

Parallel Task 2 Parallel Task 4

Parallel Task 3

OpenMP - example

Master Thread

Parallel Task 1 Parallel Task 2

Parallel Task 3

Parallel Task 4

Thread B

Thread A

Join

OpenMP Sceduler

• Which scheduler available for hardware?

- Static

- Dynamic

- Guided

OpenMP Scheduler

0

10000

20000

30000

40000

50000

60000

70000

80000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

OpenMP Static Scheduler Chart

mic

rose

cond

s

number of threads

chunk 1 chunk 10 chunk 100 chunk 1000 chunk 10000

OpenMP Scheduler

0

14625

29250

43875

58500

73125

87750

102375

117000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

OpenMP Dynamic Scheduler Chart

mic

rose

cond

s

number of threads


OpenMP Scheduler

0

10000

20000

30000

40000

50000

60000

70000

80000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

OpenMP Guided Scheduler Chart

mic

rose

cond

s

number of threads


OpenMP Scheduler

OpenMP Scheduler

static scheduler dynamic scheduler guided scheduler

MPI

• standard

• widely used in cluster environment

• many transport link supported

• different implementations available

- OpenMPI

- MVAPICH

Infiniband

• standard

• widely used in cluster environment

• very low latency for small packets

• up to 16 Gb/s transfer speed

1,0 µs

10,0 µs

100,0 µs

1000,0 µs

10000,0 µs

100000,0 µs

1000000,0 µs

10000000,0 µs

1 kB

2 kB

4 kB

8 kB

16 kB

32 kB

64 kB

128

kB

256

kB

512

kB1

MB

2 M

B4

MB

8 M

B

16 M

B

32 M

B

64 M

B

128

MB

256

MB

512

MB1

GB2

GB4

GB8

GB

16 G

B

OpenMPI Mvapich2

MPI over Infiniband

MPI over Infiniband

1,00 µs

10,00 µs

100,00 µs

1000,00 µs

10000,00 µs

100000,00 µs

1000000,00 µs

10000000,00 µs

1 kB

2 kB

4 kB

8 kB

16 kB

32 kB

64 kB

128

kB

256

kB

512

kB1

MB

2 M

B4

MB

8 M

B

OpenMPI Mvapich2

Optimizations

• Active at compile time

• Available only after porting the software to standard FORTRAN

• Consistent documentation available

• Unexpected positive results

Optimizations

•-march = native

•-O3

•-ffast-math

•-Wl,-O1

Target Software

Target Software

• Sally3D

• micromagnetic equation solver

• written in FORTRAN with some C libraries

• program uses linear formulation of mathematical models

parallel loop

OpenMP Threadsdistributed loop

Host 1 Host 2OpenMP Threads OpenMP Threads

MPI

sequential loop

standard programming

model

Implementation Scheme

Implementation Scheme

• Data Structure: not embarrassingly parallel

• Three dimensional matrix

• Several temporary arrays – synchronization obiects required

➡ send() and recv() mechanism

➡ critical regions using OpenMP directives

➡ functions merging

➡ matrix conversion

Results

ResultsOMP MPI OPT seconds

* * * 133* * - 400* - * 186* - - 487- * * 200- * - 792- - * 246- - - 1062

Total Speed Increase: 87.52%

Actual ResultsOMP MPI seconds

* * 59* - 129- * 174- - 249

Function Namecalc_intmuduacalc_hdmg_tetcalc_muduacampo_effettivo

Normal OpenMP MPI OpenMP+MPI24.5 s 4.7 s 14.4 s 2.8 s16.9 s 3.0 s 10.8 s 1.7 s12.1 s 1.9 s 7.0 s 1.1 s17.7 s 4.5 s 9.9 s 2.3 s

Actual Results

Total Raw Speed Increment: 76%

• OpenMP – 6-8x

• MPI – 2x

• OpenMP + MPI – 14 - 16x

Conclusions

Conclusions and Future Works

• Computational time has been significantly decreased

• Speedup is consistent with expected results

• Submitted to COMPUMAG ‘09

• Continue inserting OpenMP and MPI directives• Perform algorithm optimizations• Increase cluster size

Technology

Parallel and Distributed Computing on Low Latency Clusters