Upload
vuonghuong
View
217
Download
0
Embed Size (px)
Citation preview
CRYSTAL in parallel: replicated and distributed
(MPP) data (MPP) data Ian Bush
Numerical Algorithms Group Ltd, HECToR CSE
Introduction
� Why parallel ?
� What is in a parallel computer
� When parallel ?
� Pcrystal
� MPPcrystal� MPPcrystal
� Examples of MPPcrystal calculations
Why Parallel
� Actually a number of questions:1. Why do we need parallel computers ?• Can’t we just build a big, really powerful serial
computer ?computer ?
2. Why should I be interested in using parallel computers
Why Do We Need Parallel Computers ?
To avoid meltdown !
Also to keep costs down• Use standard components• Including running costs
As a result all computers are parallel today...
4
Ian’s Old Laptop at Warrington, Torino, Alessandria, South Africa, India, Oxford…
Made by Toshiba• Paid for by ME out of my hard earnt wages
• 2 Intel x86 CPUs (Core2Duo)• 1 Gbyte memory
� 0.5 Gbyte per CPU•• Not in the top 500• Reasonably typical modern machine
� All machines are parallel today !
• Runs Window$ XP and Linux� The former under protest
• The screen needs cleaning
5
What Is In A Parallel Computer ?
The building block for modern parallel computers is the Symmetric Multiprocessor (SMP)• A number of CPUs all share the same memory
� Typically 2-16 CPUs� Sometimes called a “shared memory” computer
• Simple example Ian’s laptop
6
CPU 0 CPU 1 CPU 2 CPU 3
Memory
What Is In A Parallel Computer ?Larger Parallel Computers are clusters of SMP “nodes” connected by an interconnect• E.g. ethernet, myrinet, infiniband …• A good interconnect is usually at least ½ price of the whole machine
7
CPU 0 CPU 1 CPU 2 CPU 3
Memory Memory
Interconnect
How WE Shall View Parallel Computers
A number of CPUs each with their own memory, all connected together by an interconnect
•The presence of SMPs is usually just a small complication• i.e. a cluster of “normal” serial computers
8
CPU 0 CPU 1 CPU 2 CPU 3
Mem Mem Mem Mem
Interconne
ct
Why Use Parallel Computers ?
� Faster time to solution• Jobs that take days or weeks can take hours
� More available memory• 32 Processors means in principle 32 times more memory so can run larger jobs
� BUT• More complex, how to run and how to run well can vary markedly with your local set up.
When Parallel?
When Parallel ?
Do you always want to run in parallel ?
NO!� Assigning 10 people to a job does not necessarily get the job done 10 times quicker.Similarly assigning 10 processors to a � Similarly assigning 10 processors to a computation will not make it run 10 times quicker.
So when do you want to run in parallel?
Amdahl’s Law (or the bad news)
If running on n processors
S(n)=(S+P)/(S+P/n)S+P=1
This is a somewhat frightening equation!
Amdahl’s Law
Gustafson’s Law (or the good news)
BUT the relative values of S and P are a function of system size� Parallelize the more expensive parts (first)� These typically become rapidly more expensive as the system size is increased
Parallelism is good for large systems
Load Imbalance
Say we have twenty totally independent tasks and twenty processors� Easy to parallelize – give each task to one of the processors, but …
� what if the tasks don’t take all the same time?� The time taken will be that of the longest job� The time taken will be that of the longest jobBecause of Load Imbalance our speed up is less than perfect – we have too few tasks for too many processors
Don’t use too many processors for too small a job
Communications
But what if the tasks are not independent ?� The processors will need to talk to each other� This is known as communication� This is known as communication� This uses the interconnect, so we need to understand that better
The Interconnect
The only difference between a parallel computer and normal “serial” computers is the interconnect through which data is passed. The important point is that
INTERCONNECTS INTERCONNECTS ARE SLOW
COMPARED TO CPUS
17
The Interconnect
The time to transfer N bytes of data over an interconnect usually roughly obeys
t=α+βN
where• α is the latency
� Typically ~1-100µs for modern networks� Typically ~1-100µs for modern networks� It is the time to transfer zero bytes of data� Dominates the time for short messages
• 1/β is the bandwidth� Typically ~100MByte/s-2GByte/s for modern networks� Dominates the time for long messages
Transferring data is not doing science!18
The Interconnect
t=α+βN
� α is the latency• Typically ~1-100µs for modern networks
So the SHORTEST POSSIBLE time taken in using the interconnect is around 1µs
In that time a modern CPU can do roughly 2,000 floating point operations !•To put it another way, you can do 2,000 operations that are “doing science” in the time it takes to do 1 that does no science at all
THE INTERCONNECT IS A NECESSARY EVIL
19
Communications
� So Communication is SLOW � But usually the computation requirement scales more rapidly than the communication • e.g matrix mult, work scales as N3, comms as N2
Parallelism is good for large systems
Speed Up for Linear Equation Solve
40
50
60
70
80
Sp
eed
Up
Speed Up of PDGESV on HECToR
N=20000
0
10
20
30
0 50 100 150 200 250 300
Sp
eed
Up
Number of Processors
N=20000
N=10000
N=5000
A word on I/O
Depending on how the machine is set up I/O on parallel machines can be VERY slow.
So in general it is best to run direct
This may not be true for medium sized jobs on machines where each processor has a fast local disk.
CRYSTAL – Parallel Implementations
� Pcrystal
• Replicated data• All of CRYSTAL implemented• Good for medium to large problems on small to medium processor counts
MPPcrystal� MPPcrystal
• Distributed data• Much of CRYSTAL implemented• Good for large problems on large processor counts
CRYSTAL – basic algorithm
HR = PR . IR IR = sum of independent integralsFor each k point independentlyHk=FT(HR) Hk = QkT Hk QkHkψk = εkψkPk = | ψk |2
End for each k pointPR=FT(Pk)Repeat until converged
Pcrystal - Implementation
� Standard compliant• Fortran 90• MPI for message passing
� Replicated data• Each processor has a complete copy of all the matrices used in the linear algebra
• Makes implementation very simple
Pcrystal – Parallel Integrals
� Coulomb, Exchange and DFT terms all involve many independent tasks:• Coulomb/Exchange have to evaluate integrals of the form <φiφj||φkφl> for all of i, j, k, l
• Each integral independent• So give a subset to each of the processors
� DFT terms are a numerical integration over a grid� DFT terms are a numerical integration over a grid• Each point of the grid independent• So give a subset of the grid to each processor
� Almost perfectly parallel !• Only global sum at end required• Limit on scaling is load imbalance
Pcrystal – Linear Algebra
� Each k point independent• So each processor performs the linear algebra for a subset of the k points that the job requires
• Again very few comms so potentially good scaling, but …• Potential load imbalance• Number of k points limits the number of processors that can be exploited� What if only a Γ point only calculation ?
� Limit on size of job that can be performed does not scale with number of processors• Memory limitations
Pcrystal – Changes to Your Input
Pcrystal – How to run
� How to run Pcrystal will be system dependent• Generally run in bacth• Ask other users or consult local documentation• Generally will be something like mpirun –np 4 crystal
� BIG difference is that your input file must be called INPUT, and must be accesible by the processor
Pcrystal - Summary
� In general scales very well provided the number of processors ≤ number of k points• Will gain something due to integrals• But large jobs in general require few k points
� The limit on the size of job is given by the � The limit on the size of job is given by the memory required to store the linear algebra matrices for one k point• More processors do not mean larger jobs can be run
MPP Crystal - Implementation
� Uses common standards• Fortran 90• MPI for message passing• ScaLAPACK 1.7 (Dongarra et al.) for linear algebra on distributed matrices� www.netlib.org/scalapack/scalapack_home.html
� Distributed data� Distributed data• Each processor hold only a part of each of the matrices used in the linear algebra
� More complex to implement
MPPcrystal – Parallel Integrals
� More or less as Pcrystal• Works well so why reinvent the wheel ?• However requires replicated HR,PR• Ultimate limit on size of job
� However less demanding limit than Pcrystal because these are stored in sparse format because these are stored in sparse format
� Option to distribute DFT grid� Might want to turn off Bipolar expansion
MPPcrystal – Linear algebra
� As distributed data comms are required to perform the linear algebra, unlike for Pcrystal• However N3 operations but only N2 data to communicate• Scaling gets better for larger systems
� Very rough rule of thumb – if N basis functions can exploit up to around N/15 processorscan exploit up to around N/15 processors
� Number of processors that can be exploited is NOT limited by the number of k points• Great for large Γ point calculations !
MPPcrystal – other issues
� By default runs direct• 100s or processors writing to/reading from one disk not a good idea!
� Most but not all of CRYSTAL implemented� Most but not all of CRYSTAL implemented• Will fail quickly and cleanly if requested feature not implemented
MPPCrystal - Changes to Your Input
TEST08 - SILICON BULK: STO-3GCRYSTAL0 0 02275.42114 .125 .125 .125END14 31 0 3 2. 0.1 1 3 8. 0.1 1 3 4. 0.99 0ENDSHRINK8 4 8MPPEND
MPPCrystal – Other Changes to Your Input
DISTGRID
� In the DFT input� Use a distributed DFT grid� Useful for large calculationsDCDIAG
� In the last section� Use a faster but less numerically stable diagonalizer
MPPcrystal – How to run
� Exactly the same comments as Pcrystal apply:� How to run MPPcrystal will be system dependent• Generally run in batch• Ask other users or consult local documentation• Generally will be something like mpirun –np 4 crystal• Generally will be something like mpirun –np 4 crystal
� Your input file must be called INPUT and must be accessible by the processor
MPPcrystal - summary
� For large systems can scale well, but not so good for small to medium size ones.
� Size of linear algebra matrices is, at present, not an issue given enough processors.an issue given enough processors.
� Memory limitation is from the replicated HR,PR
Pcrystal and MPPcrystal
� Pcrystal• Few comms means scales very well• However scaling limited by number of k points• Memory usage limits size systems that can be studied• Load imbalance in linear algebra may be an issue
MPPcrystal� MPPcrystal• More comms but scales well for large system• Scaling not limited by number of k points• Distributing the matrices allows larger systems to be studied
MPPcrystal – Two example
I will illustrate the behaviour of MPPcrystal with two calculations
� On a small protein, Crambin� On a small protein, Crambin� On a series of amorphous silica slabs
Crambin
� Small protein (46 residues)
� Crystal structure characterized to very high precision by XRD studies (0.52 Å)studies (0.52 Å)
� PDB entry ( 1EJG ) includes hydrogens (this is unusual
2 Chains in unit Cell
1284 Atoms
6-31G * * basis set (12354 functions)(12354 functions)
All calculations B3LYP
SCF Scaling
SCF
6
8
10
12
Cycle
s p
er
Ho
ur
.
XT3
BG/L
IBM p575
0
2
4
0 500 1000 1500
Number of Processors
Cycle
s p
er
Ho
ur
.
IBM p575
Integral Scaling
Integrals
10
15
20
25
30
Evalu
ati
on
s/h
ou
r
XT3
BG/L
IBM p575
0
5
10
0 500 1000 1500
Number of Processors
Evalu
ati
on
s/h
ou
r
IBM p575
Linear Algebra Scaling
Diagonalisation
30
40
50
Evalu
ati
on
s/h
ou
r
XT3
BG/L
0
10
20
0 500 1000 1500
Number of Processors
Evalu
ati
on
s/h
ou
r
BG/L
IBM p575
Forces Scaling
Forces
1.5
2
2.5
3
3.5
Evalu
ati
on
s/h
ou
r
XT3
BG/L
0
0.5
1
1.5
0 500 1000 1500
Number of Processors
Evalu
ati
on
s/h
ou
r
IBM p575
Electrostatic Potential
Amorphous Silica Slabs
� Inputs kindly provided by Piero Uliengo
� 3 different sizes slabs compared (the bigger two are supercells of the smallest one)are supercells of the smallest one)
MPPCrystal Speed Up on BG/P
400
500
600
700
800
900
1000
Sp
eed
Up
7756 Basis Function (579 Atoms)
15512 Basis Functions (1158 Atoms)
23268 Basis Functions (1737 Atoms)
0
100
200
300
0 200 400 600 800 1000 1200
Number Of Processors
23268 Basis Functions (1737 Atoms)
MPPCrystal Performace - 23268 Basis Functions
15
20
25
30
35
SC
F C
ycle
s p
er
ho
ur
BG/P
p575
XT4
0
5
10
0 500 1000 1500 2000 2500 3000 3500 4000
SC
F C
ycle
s p
er
ho
ur
Number Of processors
XT4
Summary
� Crystal can use to parallelization strategies• Pcrystal uses replicated data
� Good for medium to large problems� Memory limits size of problem that may be addressed� Scales well up to number of k points� The one you’ll use most often
• MPPcrystal uses distributed data� Memory limitations much less stringent than Pcrystal� For a big enough problem can scale very well