CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel

CRYSTAL in parallel: replicated and distributed

(MPP) data (MPP) data Ian Bush

Numerical Algorithms Group Ltd, HECToR CSE

Introduction

� Why parallel ?

� What is in a parallel computer

� When parallel ?

� Pcrystal

� MPPcrystal� MPPcrystal

� Examples of MPPcrystal calculations

Why Parallel

� Actually a number of questions:1. Why do we need parallel computers ?• Can’t we just build a big, really powerful serial

computer ?computer ?

2. Why should I be interested in using parallel computers

Why Do We Need Parallel Computers ?

To avoid meltdown !

Also to keep costs down• Use standard components• Including running costs

As a result all computers are parallel today...

4

Ian’s Old Laptop at Warrington, Torino, Alessandria, South Africa, India, Oxford…

Made by Toshiba• Paid for by ME out of my hard earnt wages

• 2 Intel x86 CPUs (Core2Duo)• 1 Gbyte memory

� 0.5 Gbyte per CPU•• Not in the top 500• Reasonably typical modern machine

� All machines are parallel today !

• Runs Window$ XP and Linux� The former under protest

• The screen needs cleaning

5

What Is In A Parallel Computer ?

The building block for modern parallel computers is the Symmetric Multiprocessor (SMP)• A number of CPUs all share the same memory

� Typically 2-16 CPUs� Sometimes called a “shared memory” computer

• Simple example Ian’s laptop

6

CPU 0 CPU 1 CPU 2 CPU 3

Memory

What Is In A Parallel Computer ?Larger Parallel Computers are clusters of SMP “nodes” connected by an interconnect• E.g. ethernet, myrinet, infiniband …• A good interconnect is usually at least ½ price of the whole machine

7


Memory Memory

Interconnect

How WE Shall View Parallel Computers

A number of CPUs each with their own memory, all connected together by an interconnect

•The presence of SMPs is usually just a small complication• i.e. a cluster of “normal” serial computers

8


Mem Mem Mem Mem

Interconne

ct

Why Use Parallel Computers ?

� Faster time to solution• Jobs that take days or weeks can take hours

� More available memory• 32 Processors means in principle 32 times more memory so can run larger jobs

� BUT• More complex, how to run and how to run well can vary markedly with your local set up.

When Parallel?

When Parallel ?

Do you always want to run in parallel ?

NO!� Assigning 10 people to a job does not necessarily get the job done 10 times quicker.Similarly assigning 10 processors to a � Similarly assigning 10 processors to a computation will not make it run 10 times quicker.

So when do you want to run in parallel?

Amdahl’s Law (or the bad news)

If running on n processors

S(n)=(S+P)/(S+P/n)S+P=1

This is a somewhat frightening equation!

Amdahl’s Law

Gustafson’s Law (or the good news)

BUT the relative values of S and P are a function of system size� Parallelize the more expensive parts (first)� These typically become rapidly more expensive as the system size is increased

Parallelism is good for large systems

Load Imbalance

Say we have twenty totally independent tasks and twenty processors� Easy to parallelize – give each task to one of the processors, but …

� what if the tasks don’t take all the same time?� The time taken will be that of the longest job� The time taken will be that of the longest jobBecause of Load Imbalance our speed up is less than perfect – we have too few tasks for too many processors

Don’t use too many processors for too small a job

Communications

But what if the tasks are not independent ?� The processors will need to talk to each other� This is known as communication� This is known as communication� This uses the interconnect, so we need to understand that better

The Interconnect

The only difference between a parallel computer and normal “serial” computers is the interconnect through which data is passed. The important point is that

INTERCONNECTS INTERCONNECTS ARE SLOW

COMPARED TO CPUS

17

The Interconnect

The time to transfer N bytes of data over an interconnect usually roughly obeys

t=α+βN

where• α is the latency

� Typically ~1-100µs for modern networks� Typically ~1-100µs for modern networks� It is the time to transfer zero bytes of data� Dominates the time for short messages

• 1/β is the bandwidth� Typically ~100MByte/s-2GByte/s for modern networks� Dominates the time for long messages

Transferring data is not doing science!18

The Interconnect

t=α+βN

� α is the latency• Typically ~1-100µs for modern networks

So the SHORTEST POSSIBLE time taken in using the interconnect is around 1µs

In that time a modern CPU can do roughly 2,000 floating point operations !•To put it another way, you can do 2,000 operations that are “doing science” in the time it takes to do 1 that does no science at all

THE INTERCONNECT IS A NECESSARY EVIL

19

Communications

� So Communication is SLOW � But usually the computation requirement scales more rapidly than the communication • e.g matrix mult, work scales as N3, comms as N2

Parallelism is good for large systems

Speed Up for Linear Equation Solve

40

50

60

70

80

Sp

eed

Up

Speed Up of PDGESV on HECToR

N=20000

0

10

20

30

0 50 100 150 200 250 300

Sp

eed

Up

Number of Processors

N=20000

N=10000

N=5000

A word on I/O

Depending on how the machine is set up I/O on parallel machines can be VERY slow.

So in general it is best to run direct

This may not be true for medium sized jobs on machines where each processor has a fast local disk.

CRYSTAL – Parallel Implementations

� Pcrystal

• Replicated data• All of CRYSTAL implemented• Good for medium to large problems on small to medium processor counts

MPPcrystal� MPPcrystal

• Distributed data• Much of CRYSTAL implemented• Good for large problems on large processor counts

CRYSTAL – basic algorithm

HR = PR . IR IR = sum of independent integralsFor each k point independentlyHk=FT(HR) Hk = QkT Hk QkHkψk = εkψkPk = | ψk |2

End for each k pointPR=FT(Pk)Repeat until converged

Pcrystal - Implementation

� Standard compliant• Fortran 90• MPI for message passing

� Replicated data• Each processor has a complete copy of all the matrices used in the linear algebra

• Makes implementation very simple

Pcrystal – Parallel Integrals

� Coulomb, Exchange and DFT terms all involve many independent tasks:• Coulomb/Exchange have to evaluate integrals of the form <φiφj||φkφl> for all of i, j, k, l

• Each integral independent• So give a subset to each of the processors

� DFT terms are a numerical integration over a grid� DFT terms are a numerical integration over a grid• Each point of the grid independent• So give a subset of the grid to each processor

� Almost perfectly parallel !• Only global sum at end required• Limit on scaling is load imbalance

Pcrystal – Linear Algebra

� Each k point independent• So each processor performs the linear algebra for a subset of the k points that the job requires

• Again very few comms so potentially good scaling, but …• Potential load imbalance• Number of k points limits the number of processors that can be exploited� What if only a Γ point only calculation ?

� Limit on size of job that can be performed does not scale with number of processors• Memory limitations

Pcrystal – Changes to Your Input

Pcrystal – How to run

� How to run Pcrystal will be system dependent• Generally run in bacth• Ask other users or consult local documentation• Generally will be something like mpirun –np 4 crystal

� BIG difference is that your input file must be called INPUT, and must be accesible by the processor

Pcrystal - Summary

� In general scales very well provided the number of processors ≤ number of k points• Will gain something due to integrals• But large jobs in general require few k points

� The limit on the size of job is given by the � The limit on the size of job is given by the memory required to store the linear algebra matrices for one k point• More processors do not mean larger jobs can be run

MPP Crystal - Implementation

� Uses common standards• Fortran 90• MPI for message passing• ScaLAPACK 1.7 (Dongarra et al.) for linear algebra on distributed matrices� www.netlib.org/scalapack/scalapack_home.html

� Distributed data� Distributed data• Each processor hold only a part of each of the matrices used in the linear algebra

� More complex to implement

MPPcrystal – Parallel Integrals

� More or less as Pcrystal• Works well so why reinvent the wheel ?• However requires replicated HR,PR• Ultimate limit on size of job

� However less demanding limit than Pcrystal because these are stored in sparse format because these are stored in sparse format

� Option to distribute DFT grid� Might want to turn off Bipolar expansion

MPPcrystal – Linear algebra

� As distributed data comms are required to perform the linear algebra, unlike for Pcrystal• However N3 operations but only N2 data to communicate• Scaling gets better for larger systems

� Very rough rule of thumb – if N basis functions can exploit up to around N/15 processorscan exploit up to around N/15 processors

� Number of processors that can be exploited is NOT limited by the number of k points• Great for large Γ point calculations !

MPPcrystal – other issues

� By default runs direct• 100s or processors writing to/reading from one disk not a good idea!

� Most but not all of CRYSTAL implemented� Most but not all of CRYSTAL implemented• Will fail quickly and cleanly if requested feature not implemented

MPPCrystal - Changes to Your Input

TEST08 - SILICON BULK: STO-3GCRYSTAL0 0 02275.42114 .125 .125 .125END14 31 0 3 2. 0.1 1 3 8. 0.1 1 3 4. 0.99 0ENDSHRINK8 4 8MPPEND

MPPCrystal – Other Changes to Your Input

DISTGRID

� In the DFT input� Use a distributed DFT grid� Useful for large calculationsDCDIAG

� In the last section� Use a faster but less numerically stable diagonalizer

MPPcrystal – How to run

� Exactly the same comments as Pcrystal apply:� How to run MPPcrystal will be system dependent• Generally run in batch• Ask other users or consult local documentation• Generally will be something like mpirun –np 4 crystal• Generally will be something like mpirun –np 4 crystal

� Your input file must be called INPUT and must be accessible by the processor

MPPcrystal - summary

� For large systems can scale well, but not so good for small to medium size ones.

� Size of linear algebra matrices is, at present, not an issue given enough processors.an issue given enough processors.

� Memory limitation is from the replicated HR,PR

Pcrystal and MPPcrystal

� Pcrystal• Few comms means scales very well• However scaling limited by number of k points• Memory usage limits size systems that can be studied• Load imbalance in linear algebra may be an issue

MPPcrystal� MPPcrystal• More comms but scales well for large system• Scaling not limited by number of k points• Distributing the matrices allows larger systems to be studied

MPPcrystal – Two example

I will illustrate the behaviour of MPPcrystal with two calculations

� On a small protein, Crambin� On a small protein, Crambin� On a series of amorphous silica slabs

Crambin

� Small protein (46 residues)

� Crystal structure characterized to very high precision by XRD studies (0.52 Å)studies (0.52 Å)

� PDB entry ( 1EJG ) includes hydrogens (this is unusual

2 Chains in unit Cell

1284 Atoms

6-31G * * basis set (12354 functions)(12354 functions)

All calculations B3LYP

SCF Scaling

SCF

6

8

10

12

Cycle

s p

er

Ho

ur

.

XT3

BG/L

IBM p575

0

2

4

0 500 1000 1500


Cycle

s p

er

Ho

ur

.

IBM p575

Integral Scaling

Integrals

10

15

20

25

30

Evalu

ati

on

s/h

ou

r

XT3

BG/L

IBM p575

0

5

10

0 500 1000 1500


Evalu

ati

on

s/h

ou

r

IBM p575

Linear Algebra Scaling

Diagonalisation

30

40

50

Evalu

ati

on

s/h

ou

r

XT3

BG/L

0

10

20

0 500 1000 1500


Evalu

ati

on

s/h

ou

r

BG/L

IBM p575

Forces Scaling

Forces

1.5

2

2.5

3

3.5

Evalu

ati

on

s/h

ou

r

XT3

BG/L

0

0.5

1

1.5

0 500 1000 1500


Evalu

ati

on

s/h

ou

r

IBM p575

Electrostatic Potential

Amorphous Silica Slabs

� Inputs kindly provided by Piero Uliengo

� 3 different sizes slabs compared (the bigger two are supercells of the smallest one)are supercells of the smallest one)

MPPCrystal Speed Up on BG/P

400

500

600

700

800

900

1000

Sp

eed

Up

7756 Basis Function (579 Atoms)

15512 Basis Functions (1158 Atoms)


0

100

200

300

0 200 400 600 800 1000 1200

Number Of Processors


MPPCrystal Performace - 23268 Basis Functions

15

20

25

30

35

SC

F C

ycle

s p

er

ho

ur

BG/P

p575

XT4

0

5

10

0 500 1000 1500 2000 2500 3000 3500 4000

SC

F C

ycle

s p

er

ho

ur

Number Of processors

XT4

Summary

� Crystal can use to parallelization strategies• Pcrystal uses replicated data

� Good for medium to large problems� Memory limits size of problem that may be addressed� Scales well up to number of k points� The one you’ll use most often

• MPPcrystal uses distributed data� Memory limitations much less stringent than Pcrystal� For a big enough problem can scale very well

Documents

CRYSTAL in parallel: replicated and distributed (MPP) … in parallel: replicated and distributed (MPP) data Ian Bush Numerical Algorithms Group Ltd, HECToR CSE. Introduction Why parallel