51
CRYSTAL in parallel: replicated and distributed (MPP) data (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE

CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Embed Size (px)

Citation preview

Page 1: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

CRYSTAL in parallel: replicated and distributed

(MPP) data (MPP) data Ian Bush

Numerical Algorithms Group Ltd, HECToR CSE

Page 2: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Introduction

� Why parallel ?

� What is in a parallel computer

� When parallel ?

� Pcrystal

� MPPcrystal� MPPcrystal

� Examples of MPPcrystal calculations

Page 3: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Why Parallel

� Actually a number of questions:1. Why do we need parallel computers ?• Can’t we just build a big, really powerful serial

computer ?computer ?

2. Why should I be interested in using parallel computers

Page 4: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Why Do We Need Parallel Computers ?

To avoid meltdown !

Also to keep costs down• Use standard components• Including running costs

As a result all computers are parallel today...

4

Page 5: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Ian’s Old Laptop at Warrington, Torino, Alessandria, South Africa, India, Oxford…

Made by Toshiba• Paid for by ME out of my hard earnt wages

• 2 Intel x86 CPUs (Core2Duo)• 1 Gbyte memory

� 0.5 Gbyte per CPU•• Not in the top 500• Reasonably typical modern machine

� All machines are parallel today !

• Runs Window$ XP and Linux� The former under protest

• The screen needs cleaning

5

Page 6: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

What Is In A Parallel Computer ?

The building block for modern parallel computers is the Symmetric Multiprocessor (SMP)• A number of CPUs all share the same memory

� Typically 2-16 CPUs� Sometimes called a “shared memory” computer

• Simple example Ian’s laptop

6

CPU 0 CPU 1 CPU 2 CPU 3

Memory

Page 7: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

What Is In A Parallel Computer ?Larger Parallel Computers are clusters of SMP “nodes” connected by an interconnect• E.g. ethernet, myrinet, infiniband …• A good interconnect is usually at least ½ price of the whole machine

7

CPU 0 CPU 1 CPU 2 CPU 3

Memory Memory

Interconnect

Page 8: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

How WE Shall View Parallel Computers

A number of CPUs each with their own memory, all connected together by an interconnect

•The presence of SMPs is usually just a small complication• i.e. a cluster of “normal” serial computers

8

CPU 0 CPU 1 CPU 2 CPU 3

Mem Mem Mem Mem

Interconne

ct

Page 9: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Why Use Parallel Computers ?

� Faster time to solution• Jobs that take days or weeks can take hours

� More available memory• 32 Processors means in principle 32 times more memory so can run larger jobs

� BUT• More complex, how to run and how to run well can vary markedly with your local set up.

Page 10: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

When Parallel?

Page 11: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

When Parallel ?

Do you always want to run in parallel ?

NO!� Assigning 10 people to a job does not necessarily get the job done 10 times quicker.Similarly assigning 10 processors to a � Similarly assigning 10 processors to a computation will not make it run 10 times quicker.

So when do you want to run in parallel?

Page 12: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Amdahl’s Law (or the bad news)

If running on n processors

S(n)=(S+P)/(S+P/n)S+P=1

This is a somewhat frightening equation!

Page 13: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Amdahl’s Law

Page 14: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Gustafson’s Law (or the good news)

BUT the relative values of S and P are a function of system size� Parallelize the more expensive parts (first)� These typically become rapidly more expensive as the system size is increased

Parallelism is good for large systems

Page 15: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Load Imbalance

Say we have twenty totally independent tasks and twenty processors� Easy to parallelize – give each task to one of the processors, but …

� what if the tasks don’t take all the same time?� The time taken will be that of the longest job� The time taken will be that of the longest jobBecause of Load Imbalance our speed up is less than perfect – we have too few tasks for too many processors

Don’t use too many processors for too small a job

Page 16: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Communications

But what if the tasks are not independent ?� The processors will need to talk to each other� This is known as communication� This is known as communication� This uses the interconnect, so we need to understand that better

Page 17: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

The Interconnect

The only difference between a parallel computer and normal “serial” computers is the interconnect through which data is passed. The important point is that

INTERCONNECTS INTERCONNECTS ARE SLOW

COMPARED TO CPUS

17

Page 18: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

The Interconnect

The time to transfer N bytes of data over an interconnect usually roughly obeys

t=α+βN

where• α is the latency

� Typically ~1-100µs for modern networks� Typically ~1-100µs for modern networks� It is the time to transfer zero bytes of data� Dominates the time for short messages

• 1/β is the bandwidth� Typically ~100MByte/s-2GByte/s for modern networks� Dominates the time for long messages

Transferring data is not doing science!18

Page 19: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

The Interconnect

t=α+βN

� α is the latency• Typically ~1-100µs for modern networks

So the SHORTEST POSSIBLE time taken in using the interconnect is around 1µs

In that time a modern CPU can do roughly 2,000 floating point operations !•To put it another way, you can do 2,000 operations that are “doing science” in the time it takes to do 1 that does no science at all

THE INTERCONNECT IS A NECESSARY EVIL

19

Page 20: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Communications

� So Communication is SLOW � But usually the computation requirement scales more rapidly than the communication • e.g matrix mult, work scales as N3, comms as N2

Parallelism is good for large systems

Page 21: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Speed Up for Linear Equation Solve

40

50

60

70

80

Sp

eed

Up

Speed Up of PDGESV on HECToR

N=20000

0

10

20

30

0 50 100 150 200 250 300

Sp

eed

Up

Number of Processors

N=20000

N=10000

N=5000

Page 22: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

A word on I/O

Depending on how the machine is set up I/O on parallel machines can be VERY slow.

So in general it is best to run direct

This may not be true for medium sized jobs on machines where each processor has a fast local disk.

Page 23: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

CRYSTAL – Parallel Implementations

� Pcrystal

• Replicated data• All of CRYSTAL implemented• Good for medium to large problems on small to medium processor counts

MPPcrystal� MPPcrystal

• Distributed data• Much of CRYSTAL implemented• Good for large problems on large processor counts

Page 24: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

CRYSTAL – basic algorithm

HR = PR . IR IR = sum of independent integralsFor each k point independentlyHk=FT(HR) Hk = QkT Hk QkHkψk = εkψkPk = | ψk |2

End for each k pointPR=FT(Pk)Repeat until converged

Page 25: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Pcrystal - Implementation

� Standard compliant• Fortran 90• MPI for message passing

� Replicated data• Each processor has a complete copy of all the matrices used in the linear algebra

• Makes implementation very simple

Page 26: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Pcrystal – Parallel Integrals

� Coulomb, Exchange and DFT terms all involve many independent tasks:• Coulomb/Exchange have to evaluate integrals of the form <φiφj||φkφl> for all of i, j, k, l

• Each integral independent• So give a subset to each of the processors

� DFT terms are a numerical integration over a grid� DFT terms are a numerical integration over a grid• Each point of the grid independent• So give a subset of the grid to each processor

� Almost perfectly parallel !• Only global sum at end required• Limit on scaling is load imbalance

Page 27: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Pcrystal – Linear Algebra

� Each k point independent• So each processor performs the linear algebra for a subset of the k points that the job requires

• Again very few comms so potentially good scaling, but …• Potential load imbalance• Number of k points limits the number of processors that can be exploited� What if only a Γ point only calculation ?

� Limit on size of job that can be performed does not scale with number of processors• Memory limitations

Page 28: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Pcrystal – Changes to Your Input

Page 29: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Pcrystal – How to run

� How to run Pcrystal will be system dependent• Generally run in bacth• Ask other users or consult local documentation• Generally will be something like mpirun –np 4 crystal

� BIG difference is that your input file must be called INPUT, and must be accesible by the processor

Page 30: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Pcrystal - Summary

� In general scales very well provided the number of processors ≤ number of k points• Will gain something due to integrals• But large jobs in general require few k points

� The limit on the size of job is given by the � The limit on the size of job is given by the memory required to store the linear algebra matrices for one k point• More processors do not mean larger jobs can be run

Page 31: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPP Crystal - Implementation

� Uses common standards• Fortran 90• MPI for message passing• ScaLAPACK 1.7 (Dongarra et al.) for linear algebra on distributed matrices� www.netlib.org/scalapack/scalapack_home.html

� Distributed data� Distributed data• Each processor hold only a part of each of the matrices used in the linear algebra

� More complex to implement

Page 32: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPPcrystal – Parallel Integrals

� More or less as Pcrystal• Works well so why reinvent the wheel ?• However requires replicated HR,PR• Ultimate limit on size of job

� However less demanding limit than Pcrystal because these are stored in sparse format because these are stored in sparse format

� Option to distribute DFT grid� Might want to turn off Bipolar expansion

Page 33: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPPcrystal – Linear algebra

� As distributed data comms are required to perform the linear algebra, unlike for Pcrystal• However N3 operations but only N2 data to communicate• Scaling gets better for larger systems

� Very rough rule of thumb – if N basis functions can exploit up to around N/15 processorscan exploit up to around N/15 processors

� Number of processors that can be exploited is NOT limited by the number of k points• Great for large Γ point calculations !

Page 34: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPPcrystal – other issues

� By default runs direct• 100s or processors writing to/reading from one disk not a good idea!

� Most but not all of CRYSTAL implemented� Most but not all of CRYSTAL implemented• Will fail quickly and cleanly if requested feature not implemented

Page 35: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPPCrystal - Changes to Your Input

TEST08 - SILICON BULK: STO-3GCRYSTAL0 0 02275.42114 .125 .125 .125END14 31 0 3 2. 0.1 1 3 8. 0.1 1 3 4. 0.99 0ENDSHRINK8 4 8MPPEND

Page 36: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPPCrystal – Other Changes to Your Input

DISTGRID

� In the DFT input� Use a distributed DFT grid� Useful for large calculationsDCDIAG

� In the last section� Use a faster but less numerically stable diagonalizer

Page 37: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPPcrystal – How to run

� Exactly the same comments as Pcrystal apply:� How to run MPPcrystal will be system dependent• Generally run in batch• Ask other users or consult local documentation• Generally will be something like mpirun –np 4 crystal• Generally will be something like mpirun –np 4 crystal

� Your input file must be called INPUT and must be accessible by the processor

Page 38: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPPcrystal - summary

� For large systems can scale well, but not so good for small to medium size ones.

� Size of linear algebra matrices is, at present, not an issue given enough processors.an issue given enough processors.

� Memory limitation is from the replicated HR,PR

Page 39: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Pcrystal and MPPcrystal

� Pcrystal• Few comms means scales very well• However scaling limited by number of k points• Memory usage limits size systems that can be studied• Load imbalance in linear algebra may be an issue

MPPcrystal� MPPcrystal• More comms but scales well for large system• Scaling not limited by number of k points• Distributing the matrices allows larger systems to be studied

Page 40: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPPcrystal – Two example

I will illustrate the behaviour of MPPcrystal with two calculations

� On a small protein, Crambin� On a small protein, Crambin� On a series of amorphous silica slabs

Page 41: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Crambin

� Small protein (46 residues)

� Crystal structure characterized to very high precision by XRD studies (0.52 Å)studies (0.52 Å)

� PDB entry ( 1EJG ) includes hydrogens (this is unusual

Page 42: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

2 Chains in unit Cell

1284 Atoms

6-31G * * basis set (12354 functions)(12354 functions)

All calculations B3LYP

Page 43: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

SCF Scaling

SCF

6

8

10

12

Cycle

s p

er

Ho

ur

.

XT3

BG/L

IBM p575

0

2

4

0 500 1000 1500

Number of Processors

Cycle

s p

er

Ho

ur

.

IBM p575

Page 44: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Integral Scaling

Integrals

10

15

20

25

30

Evalu

ati

on

s/h

ou

r

XT3

BG/L

IBM p575

0

5

10

0 500 1000 1500

Number of Processors

Evalu

ati

on

s/h

ou

r

IBM p575

Page 45: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Linear Algebra Scaling

Diagonalisation

30

40

50

Evalu

ati

on

s/h

ou

r

XT3

BG/L

0

10

20

0 500 1000 1500

Number of Processors

Evalu

ati

on

s/h

ou

r

BG/L

IBM p575

Page 46: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Forces Scaling

Forces

1.5

2

2.5

3

3.5

Evalu

ati

on

s/h

ou

r

XT3

BG/L

0

0.5

1

1.5

0 500 1000 1500

Number of Processors

Evalu

ati

on

s/h

ou

r

IBM p575

Page 47: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Electrostatic Potential

Page 48: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Amorphous Silica Slabs

� Inputs kindly provided by Piero Uliengo

� 3 different sizes slabs compared (the bigger two are supercells of the smallest one)are supercells of the smallest one)

Page 49: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPPCrystal Speed Up on BG/P

400

500

600

700

800

900

1000

Sp

eed

Up

7756 Basis Function (579 Atoms)

15512 Basis Functions (1158 Atoms)

23268 Basis Functions (1737 Atoms)

0

100

200

300

0 200 400 600 800 1000 1200

Number Of Processors

23268 Basis Functions (1737 Atoms)

Page 50: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

MPPCrystal Performace - 23268 Basis Functions

15

20

25

30

35

SC

F C

ycle

s p

er

ho

ur

BG/P

p575

XT4

0

5

10

0 500 1000 1500 2000 2500 3000 3500 4000

SC

F C

ycle

s p

er

ho

ur

Number Of processors

XT4

Page 51: CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

Summary

� Crystal can use to parallelization strategies• Pcrystal uses replicated data

� Good for medium to large problems� Memory limits size of problem that may be addressed� Scales well up to number of k points� The one you’ll use most often

• MPPcrystal uses distributed data� Memory limitations much less stringent than Pcrystal� For a big enough problem can scale very well