מבוא לעיבוד מקבילי

Preview:

DESCRIPTION

מבוא לעיבוד מקבילי. הרצאה מס' 10 24/12/2001. תרגיל בית מס' 3. ניתן להגיש עד ליום ה' ה- 27/12/2001. פרוייקטי גמר. קבוצות 1-10 מתבקשות להכין את המצגות שלהן לשיעור בעוד שבועיים. נא להעביר את קבצי המצגות בפורמט Point Power לפני ההרצאה או לבוא לשיעור עם CDROM צרוב. הבחנים. - PowerPoint PPT Presentation

Citation preview

מבוא לעיבוד מקבילי

10הרצאה מס'

24/12/2001

3תרגיל בית מס'

27/12/2001ניתן להגיש עד ליום ה' ה- •

פרוייקטי גמר

מתבקשות להכין את המצגות 1-10קבוצות •שלהן לשיעור בעוד שבועיים.

Pointנא להעביר את קבצי המצגות בפורמט •Power לפני ההרצאה או לבוא לשיעור עם

CDROM.צרוב

הבחנים

בדיקת הבחנים תסתיים עד ליום ו'.•

התוצאות יפורסמו בשיעור הבא.•

נושאי ההרצאה

• Today’s topics:– Shared Memory– Cilk, OpenMP– MPI – Derived Data Types– How to Build a Beowulf

Shared Memory

• Goto PDF presentation:

Chapter 8 from Wilkinson & Allan’s book.

“Programming with Shared Memory”

Summary

• Process creation

• The thread concept

• Pthread routines

• How data can be created as shared

• Condition Variables

• Dependency analysis: Bernstein’s conditions

Cilk

• A language for multithreaded parallel programming based on ANSI C.

• Cilk is designed for general-purpose parallel programming language

• Cilk is especially effective for exploiting dynamic, highly asynchronous parallelism.

A serial C program to compute the nth Fibonacci number.

A parallel Cilk program to compute the nth Fibonacci number.

Cilk - continue

• Compiling: $ cilk -O2 fib.cilk -o fib

• Executing:$ fib --nproc 4 30

OpenMP

Next 5 slides taken from the SC99 tutorial

Given by:

Tim Mattson, Intel Corporation and

Rudolf Eigenmann, Purdue University

לקריאה נוספת

High-Performance Computing

Part III

Shared Memory Parallel Processors

Back to MPI

Collective Communication

Broadcast

Collective CommunicationReduce

Collective Communication

Gather

Collective Communication

Allgather

Collective Communication

Scatter

Collective Communication

There are more collective communication commands…

• MPI – Derived Data Types

• MPI-2 – Parallel I/O

MPIנושאים מתקדמים ב-

User Defined Types

המוגדרים מראש, יכול typesמלבד ה- •המשתמש ליצור טיפוסים חדשים

• Compact pack/unpack.

Predefined Types

MPI_DOUBLE double

MPI_FLOAT float

MPI_INT signed int

MPI_LONG signed long int

MPI_LONG_DOUBLE long double

MPI_LONG_LONG_INT signed long long int

MPI_SHORT signed short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_LONG unsigned long int

MPI_UNSIGNED_SHORT unsigned short int

MPI_BYTE

Motivation

•What if you want to specify:

•non-contiguous data of a single type?

•contiguous data of mixed types?

•non-contiguous data of mixed types?

Derived datatypes save memory, are faster, more portable, and elegant.

3 Steps

1. Construct the new datatype using appropriate MPI routines:MPI_Type_contiguous, MPI_Type_vector, MPI_Type_struct, MPI_Type_indexed, MPI_Type_hvector, MPI_Type_hindexed 

2. Commit the new datatypeMPI_Type_commit 

3. Use the new datatype in sends/receives, etc.Use

#include<mpi.h>

void main(int argc, char *argv[]) {

int rank;

MPI_status status;

struct{ int x; int y; int z; }point;

MPI_Datatype ptype;

MPI_Init(&argc,&argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Type_contiguous(3,MPI_INT,&ptype);

MPI_Type_commit(&ptype);

if(rank==3){

point.x=15; point.y=23; point.z=6; MPI_Send(&point,1,ptype,1,52,MPI_COMM_WORLD);

}

else

if(rank==1) { MPI_Recv(&point,1,ptype,3,52,MPI_COMM_WORLD,&status); printf("P:%d received coords are (%d,%d,%d) \n",rank,point.x,point.y,point.z);

}

MPI_Finalize();

}

User Defined Types

• MPI_TYPE_STRUCT• MPI_TYPE_CONTIGUOUS• MPI_TYPE_VECTOR• MPI_TYPE_HVECTOR• MPI_TYPE_INDEXED• MPI_TYPE_HINDEXED

MPI_TYPE_STRUCT

is the most general way to construct an MPI derived type because it allows the length, location, and type of each component to be specified independently.

int MPI_Type_struct (int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype)

Struct Datatype Example

count = 2

array_of_blocklengths[0] = 1

array_of_types[0] = MPI_INT

array_of_blocklengths[1] = 3

array_of_types[1] = MPI_DOUBLE

MPI_TYPE_CONTIGUOUS

is the simplest of these, describing a contiguous sequence of values in memory.

For example,

MPI_Type_contiguous(2,MPI_DOUBLE,&MPI_2D_POINT);

MPI_Type_contiguous(3,MPI_DOUBLE,&MPI_3D_POINT);int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype)

MPI_TYPE_CONTIGUOUS

creates new type indicators MPI_2D_POINT and MPI_3D_POINT. These type indicators allow you to treat consecutive pairs of doubles as point coordinates in a 2-dimensional space and sequences of three doubles as point coordinates in a 3-dimensional space.

MPI_TYPE_VECTOR

describes several such sequences evenly spaced but not consecutive in memory.

MPI_TYPE_HVECTOR is similar to MPI_TYPE_VECTOR except that the distance between successive blocks is specified in bytes rather than elements.

MPI_TYPE_INDEXED describes sequences that may vary both in length and in spacing.

MPI_TYPE_VECTOR

count = 2, blocklength = 3, stride = 5

int MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype)

#include<mpi.h>

void main(int argc, char *argv[]) {

int rank,i,j;

MPI_status status;

double x[4][8];

MPI_Datatype coltype;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Type_vector(4,1,8,MPI_DOUBLE,&coltype); MPI_Type_commit(&coltype);

תכנית לדוגמא:

if(rank==3){

for(i=0;i<4;++i)

for(j=0;j<8;++j) x[i][j]=pow(10.0,i+1)+j; MPI_Send(&x[0]

[7],1,coltype,1,52,MPI_COMM_WORLD);

}

else if(rank==1) { MPI_Recv(&x[0][2],1,coltype,3,52,MPI_COMM_WORLD,&status);

for(i=0;i<4;++i)

printf("P:%d my x[%d][2]=%1f\n",rank,i,x[i][2]);

}

MPI_Finalize();

}

P:1 my x[0][2]=17.000000 P:1 my x[1][2]=107.000000 P:1 my x[2][2]=1007.000000 P:1 my x[3][2]=10007.000000

הפלט:

Committing a datatype

int MPI_Type_commit (MPI_Datatype *datatype)

Obtaining Information About Derived Types

•MPI_TYPE_LB and MPI_TYPE_UB can provide the lower and upper bounds of the type.

•MPI_TYPE_EXTENT can provide the extent of the type. In most cases, this is the amount of memory a value of the type will occupy.

•MPI_TYPE_SIZE can provide the size of the type in a message. If the type is scattered in memory, this may be significantly smaller than the extent of the type.

MPI_TYPE_EXTENT

MPI_Type_extent (MPI_Datatype datatype, MPI_Aint *extent)

Deprecated. Use MPI_Type_get_extent instead!Correction:

Ref: Ian Foster’s book: “DBPP”

MPI-2MPI-2 is a set of extensions to the MPI standard.

It was finalized by the MPI Forum in June, 1997.

MPI-2

• New Datatype Manipulation Functions

• Info Object

• New Error Handlers

• Establishing/Releasing Communications

• Extended Collective Operations

• Thread Support

• Fault Tolerant

MPI-2 Parallel I/O

• Motivation:– The ability to parallelize I/O can offer

significant performance improvements.

– User-level checkpointing is contained within the program itself.

Parallel I/O

• MPI-2 supports both blocking and nonblocking I/O

• MPI-2 supports both collective and non-collective I/O

Complementary Filetypes

Simple File Scatter/Gather - Problem

MPI-2 Parallel I/O

נושאים הקשורים בנושא שלא ילמדו במסגרת הקורס •הנוכחי:

• MPI-2 file structure• Initializing MPI-2 File I/O• Defining a View• Data Access - Reading Data• Data Access - Writing Data• Closing MPI-2 file I/O

How to Build a Beowulf

What is a Beowulf?

• A new strategy in High-Performance Computing (HPC) that exploits mass-market technology to overcome the oppressive costs in time and money of supercomputing.

What is a Beowulf?

A Collection of personal computers interconnected by widely available networking technology running one of several open-source Unix-like operating systems.

• COTS – Commodity-off-the-shelf components

• Interconnection networks: LAN/SAN

Price/Performance

How to Run Application FasterThere are 3 ways to improve performance:

–1. Work Harder

–2. Work Smarter

–3. Get HelpComputer Analogy

–1. Use faster hardware: e.g. reduce the time per instruction (clock cycle).

–2. Optimized algorithms and techniques

–3. Multiple computers to solve problem: That is, increase no. of instructions executed per clock cycle.

Motivation for using Clusters

• The communications bandwidth between workstations is increasing as new networking technologies and protocols are implemented in LANs and WANs.

• Workstation clusters are easier to integrate into existing networks than special parallel computers.

Beowulf-class SystemsA New Paradigm for the Business of Computing

• Brings high end computing to broad ranged problems– new markets

• Order of magnitude Price-Performance advantage

• Commodity enabled– no long development lead times

• Low vulnerability to vendor-specific decisions– companies are ephemeral; Beowulfs are forever

• Rapid response technology tracking

• Just-in-place user-driven configuration– requirement responsive

• Industry-wide, non-proprietary software environment

Beowulf Project - A Brief History• Started in late 1993

• NASA Goddard Space Flight Center– NASA JPL, Caltech, academic and industrial collaborators

• Sponsored by NASA HPCC Program

• Applications: single user science station– data intensive

– low cost

• General focus:– single user (dedicated) science and engineering applications

– system scalability

– Ethernet drivers for Linux

Beowulf System at JPL (Hyglac)

• 16 Pentium Pro PCs, each with 2.5 Gbyte disk, 128 Mbyte memory, Fast Ethernet card.

• Connected using 100Base-T network, through a 16-way crossbar switch.

Theoretical peak performance: 3.2 GFlop/s.

Achieved sustained performance: 1.26 GFlop/s.

Cluster Computing - Research Projects(partial list)

• Beowulf (CalTech and NASA) - USA• Condor - Wisconsin State University, USA • HPVM -(High Performance Virtual Machine),UIUC&now UCSB,US• MOSIX - Hebrew University of Jerusalem, Israel• MPI (MPI Forum, MPICH is one of the popular implementations) • NOW (Network of Workstations) - Berkeley, USA • NIMROD - Monash University, Australia • NetSolve - University of Tennessee, USA• PBS (Portable Batch System) - NASA Ames and LLNL, USA • PVM - Oak Ridge National Lab./UTK/Emory, USA

Motivation for using Clusters• Surveys show utilisation of CPU cycles of

desktop workstations is typically <10%.

• Performance of workstations and PCs is rapidly improving

• As performance grows, percent utilisation will decrease even further!

• Organisations are reluctant to buy large supercomputers, due to the large expense and short useful life span.

Motivation for using Clusters

• The development tools for workstations are more mature than the contrasting proprietary solutions for parallel computers - mainly due to the non-standard nature of many parallel systems.

• Workstation clusters are a cheap and readily available alternative to specialised High Performance Computing (HPC) platforms.

• Use of clusters of workstations as a distributed compute resource is very cost effective - incremental growth of system!!!

Original Food Chain Picture

1984 Computer Food Chain

Mainframe

Vector Supercomputer

Mini ComputerWorkstation

PC

Mainframe

Vector Supercomputer MPP

WorkstationPC

1994 Computer Food Chain

Mini Computer(hitting wall soon)

(future is bleak)

Computer Food Chain (Now and Future)

Parallel Computing

Cluster Computing MetaComputing Tightly Coupled

Vector

Pile of PCs NOW/COW WS Farms/cycle harvesting

Beowulf NT-PC Cluster DASHMEM-NUMA

PC Clusters: small, medium, large…

Computing Elements

P PP P P PMicro kernelMicro kernel

Multi-Processor Computing System

Threads InterfaceThreads Interface

Hardware

Operating System

ProcessProcessor ThreadPP

Applications

Networking

• Topology

• Hardware

• Cost

• Performance

Cluster Building Blocks

Channel Bonding

Myrinet

Myrinet 2000 switch

Myrinet 2000 NIC

Example: 320-host Clos topology of 16-port switches

64 hosts 64 hosts 64 hosts 64 hosts 64 hosts

(From Myricom)

Myrinet

•Full-duplex 2+2 Gigabit/second data rate links, switch ports, and interface ports.

•Flow control, error control, and "heartbeat" continuity monitoring on every link.

•Low-latency, cut-through, crossbar switches, with monitoring for high-availability applications.

•Switch networks that can scale to tens of thousands of hosts, and that can also provide alternative communication paths between hosts.

•Host interfaces that execute a control program to interact directly with host processes ("OS bypass") for low-latency communication, and directly with the network to send, receive, and buffer packets.

Myrinet

• Sustained one-way data rate for large messages: 1.92mbps

• Latency for short messages: 9sec.

Gigabit Ethernet

Switches by 3COM and Avaya

Cajun M770Cajun P882

Cajun 550

Network Topology

Network Topology

Network Topology

Topology of the Velocity+ Cluster at CTC

Software: all this list for free!

• Compilers: FORTRAN, C/C++• Java: JDK from Sun, IBM and others• Scripting: Perl, Python, awk…• Editors: vi, (x)emacs, kedit, gedit…• Scientific writing: LaTex, Ghostview…• Plotting: gnuplot• Image processing: xview, • …and much more!!!

בניית מערך מקבילי

top of the line מעבדים 32•

רשת תקשורת מהירה•

Hardware

Dual P4 2HGz

כמה זה עולה לנו?

זיכרון מהיר 2GB עם דואלי4מחשב פנטיום-RDRAM: $3,000

1GB memory/CPU

)Linux ($0מערכת הפעלה: •

כמה זה עולה לנו?

• PCI64B @ 133MHz, Myrinet2000 NIC with 2M memory: $1,195

• Myrinet-2000 fiber cables, 3m long: $110

• 16-port switch with Fiber ports: $5,625

כמה זה עולה לנו?

• KVM: 16port. ~$1,000

• Avocent (Cybex) using cat5 IP over Ethernet

כמה זה עולה לנו?

$48,000=16*$3000 מחשב:•

$20,880=16)*1,195+110(כרטיס רשת:•

$5,625מתג תקשורת:•

•KVM: $1,000

$500מסך + שונות:•

$76,005סה"כ: •

כוח חישוב תיאורטי שיאי:•

• 2*32=64GFLOPS

• $76,000/64=1,187$/GFLOP

Less than 1.2$/MFLOP!!!

מה עוד נדרש?

מקום!, מיזוג אויר (קירור), מערכת חשמל לגיבוי •(אל-פסק).

NFS or(נוח שאחת התחנות תשמש כשרת קבצים •other files sharing system(

.NIS בכלי כגון )users(ניהול המשתמשים • routingקישור לרשת חיצונית: אחת התחנות עושה •

פנימי לחיצוני.IPממרחב כתובות .bWatch כדוגמת Monitoringכלי •

התקנת המערכת

תחילה ניתן להתקין מחשב יחיד•

את יתר המחשבים ניתן להתקין על-ידי שיכפול •של המחשב הראשון (לדוגמא הדיסק הקשיח).Ghostע"י תכנה כגון

XXXהתקנת תוכנה )MPI(למשל

• Download xxx.tar.gz

• Uncompress: gzip –d xxx.tar.gz

• Untar: tar xvf xxx.tar

• Prepare makefile: ./configure

• Make (Makefile)

תכנות מיקבול צריכות…

• “rlogin” must be allowed (xinitd: disable=no)

• Create “.rhosts” file

• Parallel administration tools: “brsh”, “prsh” and self-made scripts.

בשבוע הבא

MPIנושאים נוספים ב-•

•Grid Computing

חישובים מקביליים בבעיות מדעיות•

סיכום•

נא להתחיל לעבוד על הפרויקטים!

המצגות מתחילות בעוד שבועיים!