Upload
kevork
View
31
Download
3
Embed Size (px)
DESCRIPTION
מבוא לעיבוד מקבילי. הרצאה מס' 10 24/12/2001. תרגיל בית מס' 3. ניתן להגיש עד ליום ה' ה- 27/12/2001. פרוייקטי גמר. קבוצות 1-10 מתבקשות להכין את המצגות שלהן לשיעור בעוד שבועיים. נא להעביר את קבצי המצגות בפורמט Point Power לפני ההרצאה או לבוא לשיעור עם CDROM צרוב. הבחנים. - PowerPoint PPT Presentation
Citation preview
מבוא לעיבוד מקבילי
10הרצאה מס'
24/12/2001
3תרגיל בית מס'
27/12/2001ניתן להגיש עד ליום ה' ה- •
פרוייקטי גמר
מתבקשות להכין את המצגות 1-10קבוצות •שלהן לשיעור בעוד שבועיים.
Pointנא להעביר את קבצי המצגות בפורמט •Power לפני ההרצאה או לבוא לשיעור עם
CDROM.צרוב
הבחנים
בדיקת הבחנים תסתיים עד ליום ו'.•
התוצאות יפורסמו בשיעור הבא.•
נושאי ההרצאה
• Today’s topics:– Shared Memory– Cilk, OpenMP– MPI – Derived Data Types– How to Build a Beowulf
Shared Memory
• Goto PDF presentation:
Chapter 8 from Wilkinson & Allan’s book.
“Programming with Shared Memory”
Summary
• Process creation
• The thread concept
• Pthread routines
• How data can be created as shared
• Condition Variables
• Dependency analysis: Bernstein’s conditions
Cilk
http://supertech.lcs.mit.edu/cilk
Cilk
• A language for multithreaded parallel programming based on ANSI C.
• Cilk is designed for general-purpose parallel programming language
• Cilk is especially effective for exploiting dynamic, highly asynchronous parallelism.
A serial C program to compute the nth Fibonacci number.
A parallel Cilk program to compute the nth Fibonacci number.
Cilk - continue
• Compiling: $ cilk -O2 fib.cilk -o fib
• Executing:$ fib --nproc 4 30
OpenMP
Next 5 slides taken from the SC99 tutorial
Given by:
Tim Mattson, Intel Corporation and
Rudolf Eigenmann, Purdue University
לקריאה נוספת
High-Performance Computing
Part III
Shared Memory Parallel Processors
Back to MPI
Collective Communication
Broadcast
Collective CommunicationReduce
Collective Communication
Gather
Collective Communication
Allgather
Collective Communication
Scatter
Collective Communication
There are more collective communication commands…
• MPI – Derived Data Types
• MPI-2 – Parallel I/O
MPIנושאים מתקדמים ב-
User Defined Types
המוגדרים מראש, יכול typesמלבד ה- •המשתמש ליצור טיפוסים חדשים
• Compact pack/unpack.
Predefined Types
MPI_DOUBLE double
MPI_FLOAT float
MPI_INT signed int
MPI_LONG signed long int
MPI_LONG_DOUBLE long double
MPI_LONG_LONG_INT signed long long int
MPI_SHORT signed short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_LONG unsigned long int
MPI_UNSIGNED_SHORT unsigned short int
MPI_BYTE
Motivation
•What if you want to specify:
•non-contiguous data of a single type?
•contiguous data of mixed types?
•non-contiguous data of mixed types?
Derived datatypes save memory, are faster, more portable, and elegant.
3 Steps
1. Construct the new datatype using appropriate MPI routines:MPI_Type_contiguous, MPI_Type_vector, MPI_Type_struct, MPI_Type_indexed, MPI_Type_hvector, MPI_Type_hindexed
2. Commit the new datatypeMPI_Type_commit
3. Use the new datatype in sends/receives, etc.Use
#include<mpi.h>
void main(int argc, char *argv[]) {
int rank;
MPI_status status;
struct{ int x; int y; int z; }point;
MPI_Datatype ptype;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Type_contiguous(3,MPI_INT,&ptype);
MPI_Type_commit(&ptype);
if(rank==3){
point.x=15; point.y=23; point.z=6; MPI_Send(&point,1,ptype,1,52,MPI_COMM_WORLD);
}
else
if(rank==1) { MPI_Recv(&point,1,ptype,3,52,MPI_COMM_WORLD,&status); printf("P:%d received coords are (%d,%d,%d) \n",rank,point.x,point.y,point.z);
}
MPI_Finalize();
}
User Defined Types
• MPI_TYPE_STRUCT• MPI_TYPE_CONTIGUOUS• MPI_TYPE_VECTOR• MPI_TYPE_HVECTOR• MPI_TYPE_INDEXED• MPI_TYPE_HINDEXED
MPI_TYPE_STRUCT
is the most general way to construct an MPI derived type because it allows the length, location, and type of each component to be specified independently.
int MPI_Type_struct (int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype)
Struct Datatype Example
count = 2
array_of_blocklengths[0] = 1
array_of_types[0] = MPI_INT
array_of_blocklengths[1] = 3
array_of_types[1] = MPI_DOUBLE
MPI_TYPE_CONTIGUOUS
is the simplest of these, describing a contiguous sequence of values in memory.
For example,
MPI_Type_contiguous(2,MPI_DOUBLE,&MPI_2D_POINT);
MPI_Type_contiguous(3,MPI_DOUBLE,&MPI_3D_POINT);int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_CONTIGUOUS
creates new type indicators MPI_2D_POINT and MPI_3D_POINT. These type indicators allow you to treat consecutive pairs of doubles as point coordinates in a 2-dimensional space and sequences of three doubles as point coordinates in a 3-dimensional space.
MPI_TYPE_VECTOR
describes several such sequences evenly spaced but not consecutive in memory.
MPI_TYPE_HVECTOR is similar to MPI_TYPE_VECTOR except that the distance between successive blocks is specified in bytes rather than elements.
MPI_TYPE_INDEXED describes sequences that may vary both in length and in spacing.
MPI_TYPE_VECTOR
count = 2, blocklength = 3, stride = 5
int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype)
#include<mpi.h>
void main(int argc, char *argv[]) {
int rank,i,j;
MPI_status status;
double x[4][8];
MPI_Datatype coltype;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Type_vector(4,1,8,MPI_DOUBLE,&coltype); MPI_Type_commit(&coltype);
תכנית לדוגמא:
if(rank==3){
for(i=0;i<4;++i)
for(j=0;j<8;++j) x[i][j]=pow(10.0,i+1)+j; MPI_Send(&x[0]
[7],1,coltype,1,52,MPI_COMM_WORLD);
}
else if(rank==1) { MPI_Recv(&x[0][2],1,coltype,3,52,MPI_COMM_WORLD,&status);
for(i=0;i<4;++i)
printf("P:%d my x[%d][2]=%1f\n",rank,i,x[i][2]);
}
MPI_Finalize();
}
P:1 my x[0][2]=17.000000 P:1 my x[1][2]=107.000000 P:1 my x[2][2]=1007.000000 P:1 my x[3][2]=10007.000000
הפלט:
Committing a datatype
int MPI_Type_commit (MPI_Datatype *datatype)
Obtaining Information About Derived Types
•MPI_TYPE_LB and MPI_TYPE_UB can provide the lower and upper bounds of the type.
•MPI_TYPE_EXTENT can provide the extent of the type. In most cases, this is the amount of memory a value of the type will occupy.
•MPI_TYPE_SIZE can provide the size of the type in a message. If the type is scattered in memory, this may be significantly smaller than the extent of the type.
MPI_TYPE_EXTENT
MPI_Type_extent (MPI_Datatype datatype, MPI_Aint *extent)
Deprecated. Use MPI_Type_get_extent instead!Correction:
Ref: Ian Foster’s book: “DBPP”
MPI-2MPI-2 is a set of extensions to the MPI standard.
It was finalized by the MPI Forum in June, 1997.
MPI-2
• New Datatype Manipulation Functions
• Info Object
• New Error Handlers
• Establishing/Releasing Communications
• Extended Collective Operations
• Thread Support
• Fault Tolerant
MPI-2 Parallel I/O
• Motivation:– The ability to parallelize I/O can offer
significant performance improvements.
– User-level checkpointing is contained within the program itself.
Parallel I/O
• MPI-2 supports both blocking and nonblocking I/O
• MPI-2 supports both collective and non-collective I/O
Complementary Filetypes
Simple File Scatter/Gather - Problem
MPI-2 Parallel I/O
נושאים הקשורים בנושא שלא ילמדו במסגרת הקורס •הנוכחי:
• MPI-2 file structure• Initializing MPI-2 File I/O• Defining a View• Data Access - Reading Data• Data Access - Writing Data• Closing MPI-2 file I/O
How to Build a Beowulf
What is a Beowulf?
• A new strategy in High-Performance Computing (HPC) that exploits mass-market technology to overcome the oppressive costs in time and money of supercomputing.
What is a Beowulf?
A Collection of personal computers interconnected by widely available networking technology running one of several open-source Unix-like operating systems.
• COTS – Commodity-off-the-shelf components
• Interconnection networks: LAN/SAN
Price/Performance
How to Run Application FasterThere are 3 ways to improve performance:
–1. Work Harder
–2. Work Smarter
–3. Get HelpComputer Analogy
–1. Use faster hardware: e.g. reduce the time per instruction (clock cycle).
–2. Optimized algorithms and techniques
–3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.
Motivation for using Clusters
• The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs.
• Workstation clusters are easier to integrate into existing networks than special parallel computers.
Beowulf-class SystemsA New Paradigm for the Business of Computing
• Brings high end computing to broad ranged problems– new markets
• Order of magnitude Price-Performance advantage
• Commodity enabled– no long development lead times
• Low vulnerability to vendor-specific decisions– companies are ephemeral; Beowulfs are forever
• Rapid response technology tracking
• Just-in-place user-driven configuration– requirement responsive
• Industry-wide, non-proprietary software environment
Beowulf Project - A Brief History• Started in late 1993
• NASA Goddard Space Flight Center– NASA JPL, Caltech, academic and industrial collaborators
• Sponsored by NASA HPCC Program
• Applications: single user science station– data intensive
– low cost
• General focus:– single user (dedicated) science and engineering applications
– system scalability
– Ethernet drivers for Linux
Beowulf System at JPL (Hyglac)
• 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory, Fast Ethernet card.
• Connected using 100Base-T network, through a 16-way crossbar switch.
Theoretical peak performance: 3.2 GFlop/s.
Achieved sustained performance: 1.26 GFlop/s.
Cluster Computing - Research Projects(partial list)
• Beowulf (CalTech and NASA) - USA• Condor - Wisconsin State University, USA • HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US• MOSIX - Hebrew University of Jerusalem, Israel• MPI (MPI Forum, MPICH is one of the popular implementations) • NOW (Network of Workstations) - Berkeley, USA • NIMROD - Monash University, Australia • NetSolve - University of Tennessee, USA• PBS (Portable Batch System) - NASA Ames and LLNL, USA • PVM - Oak Ridge National Lab./UTK/Emory, USA
Motivation for using Clusters• Surveys show utilisation of CPU cycles of
desktop workstations is typically <10%.
• Performance of workstations and PCs is rapidly improving
• As performance grows, percent utilisation will decrease even further!
• Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span.
Motivation for using Clusters
• The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the non-standard nature of many parallel systems.
• Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms.
• Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!
Original Food Chain Picture
1984 Computer Food Chain
Mainframe
Vector Supercomputer
Mini ComputerWorkstation
PC
Mainframe
Vector Supercomputer MPP
WorkstationPC
1994 Computer Food Chain
Mini Computer(hitting wall soon)
(future is bleak)
Computer Food Chain (Now and Future)
Parallel Computing
Cluster Computing MetaComputing Tightly Coupled
Vector
Pile of PCs NOW/COW WS Farms/cycle harvesting
Beowulf NT-PC Cluster DASHMEM-NUMA
PC Clusters: small, medium, large…
Computing Elements
P PP P P PMicro kernelMicro kernel
Multi-Processor Computing System
Threads InterfaceThreads Interface
Hardware
Operating System
ProcessProcessor ThreadPP
Applications
Networking
• Topology
• Hardware
• Cost
• Performance
Cluster Building Blocks
Channel Bonding
Myrinet
Myrinet 2000 switch
Myrinet 2000 NIC
Example: 320-host Clos topology of 16-port switches
64 hosts 64 hosts 64 hosts 64 hosts 64 hosts
(From Myricom)
Myrinet
•Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports.
•Flow control, error control, and "heartbeat" continuity monitoring on every link.
•Low-latency, cut-through, crossbar switches, with monitoring for high-availability applications.
•Switch networks that can scale to tens of thousands of hosts, and that can also provide alternative communication paths between hosts.
•Host interfaces that execute a control program to interact directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets.
Myrinet
• Sustained one-way data rate for large messages: 1.92mbps
• Latency for short messages: 9sec.
Gigabit Ethernet
Switches by 3COM and Avaya
Cajun M770Cajun P882
Cajun 550
Network Topology
Network Topology
Network Topology
Topology of the Velocity+ Cluster at CTC
Software: all this list for free!
• Compilers: FORTRAN, C/C++• Java: JDK from Sun, IBM and others• Scripting: Perl, Python, awk…• Editors: vi, (x)emacs, kedit, gedit…• Scientific writing: LaTex, Ghostview…• Plotting: gnuplot• Image processing: xview, • …and much more!!!
בניית מערך מקבילי
top of the line מעבדים 32•
רשת תקשורת מהירה•
Hardware
Dual P4 2HGz
כמה זה עולה לנו?
זיכרון מהיר 2GB עם דואלי4מחשב פנטיום-RDRAM: $3,000
1GB memory/CPU
)Linux ($0מערכת הפעלה: •
כמה זה עולה לנו?
• PCI64B @ 133MHz, Myrinet2000 NIC with 2M memory: $1,195
• Myrinet-2000 fiber cables, 3m long: $110
• 16-port switch with Fiber ports: $5,625
כמה זה עולה לנו?
• KVM: 16port. ~$1,000
• Avocent (Cybex) using cat5 IP over Ethernet
כמה זה עולה לנו?
$48,000=16*$3000 מחשב:•
$20,880=16)*1,195+110(כרטיס רשת:•
$5,625מתג תקשורת:•
•KVM: $1,000
$500מסך + שונות:•
$76,005סה"כ: •
כוח חישוב תיאורטי שיאי:•
• 2*32=64GFLOPS
• $76,000/64=1,187$/GFLOP
Less than 1.2$/MFLOP!!!
מה עוד נדרש?
מקום!, מיזוג אויר (קירור), מערכת חשמל לגיבוי •(אל-פסק).
NFS or(נוח שאחת התחנות תשמש כשרת קבצים •other files sharing system(
.NIS בכלי כגון )users(ניהול המשתמשים • routingקישור לרשת חיצונית: אחת התחנות עושה •
פנימי לחיצוני.IPממרחב כתובות .bWatch כדוגמת Monitoringכלי •
התקנת המערכת
תחילה ניתן להתקין מחשב יחיד•
את יתר המחשבים ניתן להתקין על-ידי שיכפול •של המחשב הראשון (לדוגמא הדיסק הקשיח).Ghostע"י תכנה כגון
XXXהתקנת תוכנה )MPI(למשל
• Download xxx.tar.gz
• Uncompress: gzip –d xxx.tar.gz
• Untar: tar xvf xxx.tar
• Prepare makefile: ./configure
• Make (Makefile)
תכנות מיקבול צריכות…
• “rlogin” must be allowed (xinitd: disable=no)
• Create “.rhosts” file
• Parallel administration tools: “brsh”, “prsh” and self-made scripts.
References
• Beowulf: http://www.beowulf.org
• Computer Architecture:
http://www.cs.wisc.edu/~arch/www/
בשבוע הבא
MPIנושאים נוספים ב-•
•Grid Computing
חישובים מקביליים בבעיות מדעיות•
סיכום•
נא להתחיל לעבוד על הפרויקטים!
המצגות מתחילות בעוד שבועיים!