64
1 e-mail [email protected] http://www.man.poznan.pl/

e-mail [email protected] man.poznan.pl

  • Upload
    aimee

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

e-mail [email protected] http://www.man.poznan.pl/. POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER. Homogeniczne i heterogeniczne środowiska. Środowisko homogeniczne: jednorodne elementy składowe charakteryzują się tymi samymi wartościami, cechami skalowalne - PowerPoint PPT Presentation

Citation preview

Page 1: e-mail pawelw@man.poznan.pl             man.poznan.pl

1e-mail [email protected] http://www.man.poznan.pl/

Page 2: e-mail pawelw@man.poznan.pl             man.poznan.pl

2

POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER

Homogeniczne i heterogeniczne środowiskaHomogeniczne i heterogeniczne środowiska

• Środowisko homogeniczne:Środowisko homogeniczne:• jednorodnejednorodne

• elementy składowe charakteryzują się tymi samymi elementy składowe charakteryzują się tymi samymi

wartościami, cechamiwartościami, cechami

• skalowalneskalowalne

• Środowisko heterogeniczne:Środowisko heterogeniczne:• różnorodność elementów składowychróżnorodność elementów składowych

• zróżnicowany zbiór parametrów, cechzróżnicowany zbiór parametrów, cech

• skalowalneskalowalne

• trudne w zarządzaniutrudne w zarządzaniu

• Różne systemy

operacyjne

• Różne architektury

• Różni producenci

Page 3: e-mail pawelw@man.poznan.pl             man.poznan.pl

3

POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER

Zasoby Zasoby

• procesor (cpu, rodzaj)procesor (cpu, rodzaj)• częstotliwość (zróżnicowane płyty CPU),częstotliwość (zróżnicowane płyty CPU),

• typ, np. skalarny, wektorowy , graficznytyp, np. skalarny, wektorowy , graficzny

• RAM (typ, wielkość)RAM (typ, wielkość)

• we/wywe/wy• interfejsy sieciowe,interfejsy sieciowe,

• dyski,dyski,

• ‘ ‘graphics engines’graphics engines’

• pamięć masowapamięć masowa

• pojedyncze systemy (węzły w sieci)pojedyncze systemy (węzły w sieci)• specjalizowane systemy (obliczeniowe, graficzne, archiwizacji, etc.)specjalizowane systemy (obliczeniowe, graficzne, archiwizacji, etc.)

Page 4: e-mail pawelw@man.poznan.pl             man.poznan.pl

4

POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER

Zapotrzebowanie na zasoby 1/2 Zapotrzebowanie na zasoby 1/2

ComputeCompute

VisualizeVisualizeDataData

BIG Compute Problems•Computing•Visualization •Data Handling

BIG Visualization Problems•Computing•Visualization •Data Handling

BIG Data Problems•Computing•Visualization •Data Handling

Page 5: e-mail pawelw@man.poznan.pl             man.poznan.pl

5

POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER

Zapotrzebowanie na zasoby 2/2 Zapotrzebowanie na zasoby 2/2

I/O

Web serving

Weather simulation CPU

Storage

Repository / archive

Signal processing

Media streaming

Traditional big supercomputer

Scale in Any and All Dimensions

Page 6: e-mail pawelw@man.poznan.pl             man.poznan.pl

6

Page 7: e-mail pawelw@man.poznan.pl             man.poznan.pl

7

Typy klastrów- zapewniające niezawodność (ang. high-availability cluster), których zadanie

polega na zapewnieniu ciągłej pracy systemu i przerzucenie obciążenia na zapasowe węzły w przypadku awarii (np. serwery WWW, e-commerce)

- obliczeniowe (ang. capability cluster), których zadaniem jest przetwarzanie równoległe aplikacji dla celów naukowych, inżynierskich czy projektowych. Wymagane jest zapewnienie wydajnych mechanizmów komunikacji między węzłami, co umożliwi wykorzystanie wysokiego stopnia równoległości (fain grain granularity). Klastry obliczeniowe przeważnie są dedykowane dla określonej aplikacji, a programy są wykonywane sekwencyjnie i nie współzawodniczą między sobą w dostępie do zasobów.

- skalowalne (ang. scalability cluster), których zadaniem jest poprawienie efektywności wykonywania programów poprzez odpowiednie przydzielanie węzłów do aplikacji. Wymagane jest oprogramowanie zarządzające zapewniające uruchamianie zadań, load balancing, analizę obciążenia i zarządzanie zadaniami. Ewentualne zadania rozproszone mogą wykorzystywać równoległość na poziomie procedur i modułów.

Page 8: e-mail pawelw@man.poznan.pl             man.poznan.pl

8

Single system imageSingle Point of Entry: A user can connect to the cluster as a single system (like telnet

beowulf.myinstitute.edu), instead of connecting to individual nodes as in the case of distributed systems (like telnet node1.beowulf.myinstitute.edu).

Single File Hierarchy (SFH): On entering into the system, the user sees a file system as a single hierarchy of files and directories under the same root directory. Examples: xFS and Solaris MC Proxy.

Single Point of Management and Control: The entire cluster can be monitored or controlled from a single window using a single GUI tool, much like an NT workstation managed by the Task Manager tool or PARMON monitoring the cluster resources

Single Virtual Networking: This means that any node can access any network connection throughout the cluster domain even if the network is not physically connected to all nodes in the cluster.

Single Memory Space: This illusion of shared memory over memories associated with nodes of the cluster.

Single Job Management System: A user can submit a job from any node using a transparent job submission mechanism. Jobs can be scheduled to run in either batch, interactive, or parallel modes (discussed later). Example systems include LSF and CODINE.

Single User Interface: The user should be able to use the cluster through a single GUI. The interface must have the same look and feel of an interface that is available for workstations (e.g., Solaris OpenWin or Windows NT GUI).

Page 9: e-mail pawelw@man.poznan.pl             man.poznan.pl

9

Single system image

Availability Support Functions Single I/O Space (SIOS): This allows any node to perform I/O operation on local

or remotely located peripheral or disk devices. In this SIOS design, disks associated with cluster nodes, RAIDs, and peripheral devices form a single address space.

Single Process Space: Processes have a unique cluster-wide process id. A process on any node can create child processes on the same or different node (through a UNIX fork) or communicate with any other process (through signals and pipes) on a remote node. This cluster should support globalized process management and allow the management and control of processes as if they are running on local machines.

Checkpointing and Process Migration: Checkpointing mechanisms allow a process state and intermediate computing results to be saved periodically. When a node fails, processes on the failed node can be restarted on another working

Page 10: e-mail pawelw@man.poznan.pl             man.poznan.pl

10

C-brickCPU Module

D-brickDisk Storage

R-brickRouter Interconnect

X-brickXIO Expansion

P-brickPCI Expansion

I-brickBase I/O Module

G-brickGraphics Expansion

Stopień złożonościStopień złożoności

Page 11: e-mail pawelw@man.poznan.pl             man.poznan.pl

11

POZNAŃ SUPERCOMPUTING AND NETWORKING CENTER

Klastry homogeniczne

• GigaRing, SuperCluster (T3E)GigaRing, SuperCluster (T3E)

• PowerChallengeArrayPowerChallengeArray

• POEPOE

• DCEDCE

• Zarządzanie dużymi ilościami danychZarządzanie dużymi ilościami danych

• Systemy archiwizacjiSystemy archiwizacji

Page 12: e-mail pawelw@man.poznan.pl             man.poznan.pl

Massively Parallel Processing (MPP)Massively Parallel Processing (MPP)• Massively parallel approaches achieve high processing rates by

assembling large numbers of relatively slow processors

• Traditional approaches focus on improving the speed of individual processors and assembly only a few of these powerfull processors for a complete machine

• Improving network speed and communication overheads

• Examples :

– Thinking Machines (CM-2, CM-5)

– Intel Paragon

– Kendall Square (KS-1)

– SGI Origin 2000

– Cray T3D, T3E

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Page 13: e-mail pawelw@man.poznan.pl             man.poznan.pl

Some commonly used network topologies

MPP’s network topologies MPP’s network topologies

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Ring

2

2-DimensionalMesh

44

3-DimensionalMesh

66

N=3

Hypercube2N Nodes

2N

Nodes

TopologyTopology ConnectivityConnectivity

Page 14: e-mail pawelw@man.poznan.pl             man.poznan.pl

Cray T3E, T3DCray T3E, T3D• The Cray MPP system contains four types of components: processing

element nodes, the interconnect network, I/O gateways and a clock

• Network topology: 3D Mesh

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

-Y

+Y

+X

-X +Z

-Z

Node B

Node A

Cray T3D Cray T3D System ComponentsSystem Components

InterconnectNetwork

Processing ElementNode

I/O Gateway

Page 15: e-mail pawelw@man.poznan.pl             man.poznan.pl

Processing Element Nodes (PE)• Each PE contains a microprocessor, local memory and support circuitry

• 64-bit DEC Alpha RISC processor

• Very high scalability (8 ... 2048 CPUs)

Cray T3ECray T3E

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Memory

CPU Switch

Links

Node B

Node A

Page 16: e-mail pawelw@man.poznan.pl             man.poznan.pl

Interconnect Network

• The interconnect network provides communication paths between PEs

• There is formed a three dimensional matrix of paths that connect the nodes in X, Y and Z dimensions

• A communication linkcommunication link transfers data and control information between two network routers, connects two nodes in one dimension.

A communication link is actually two unidirectional channels. Each channel in the link contains data, control and acknowledge signals.

• Dimension order routing (predefined methods of information traveling)

• Fault tolerance

Cray T3ECray T3E

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Page 17: e-mail pawelw@man.poznan.pl             man.poznan.pl

Distributed operating system (Unicos/mk) In the CRAY T3E systems, the local memory of each PE must contain a

copy of the microkernel and one or more servers. Under Unicos/mk each PE is configured as one of the following types of PEs:

• Support PEs

The local memory of support PEs contains a copy of the microkernel and servers. The exact number and type of servers vary depending on configuration tuning.

• User PEs

The local memory of user PEs contains a copy of the microkernel and a minimum number of servers. Because it contains a limited amount of operating system code, most of a user PE’s local memory is available to the user. User PEs include command and application PEs

• Redundant PE

A redundant PE is not configured into the system until an active PE fails.

Cray T3ECray T3EPoznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Page 18: e-mail pawelw@man.poznan.pl             man.poznan.pl

Distributed operating system (Unicos/microkernel) • Unicos/mk does not require a common memory architecture. Unlike

Unicos, the functions of Unicos/mk are devided between a microkernel and numerous servers. For this reason, Unicos/mk is referred to as a serverized operating system.

• Serverized operating systems offer a distinct advantage for the Cray T3E system because of its distributed memory architecture. Within these systems, the local memory of each PE is not required to hold the entire set of OS code

• The operating system can be distributed across the PEs in the whole system • Under Unicos/mk, traditional UNICOS processes are implemented as

actors. Actors represents a resource allocation entity. The microkernel views all user processes, servers and daemons as actors

• A multiple PE application has one actor per PE. User and daemon actors reside in user address space; server actors reside in supervisory (kernel address) space.

Cray T3ECray T3EPoznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Page 19: e-mail pawelw@man.poznan.pl             man.poznan.pl

19

Page 20: e-mail pawelw@man.poznan.pl             man.poznan.pl

20

T3EMS – konfiguracja PE

Page 21: e-mail pawelw@man.poznan.pl             man.poznan.pl

21

T3E – szeregowanie zadań

Page 22: e-mail pawelw@man.poznan.pl             man.poznan.pl

22

Moduły demona pschedGang scheduler

Provides application CPU and memory residency control by enabling you to schedule all members of an application together. This guarantees that the application members are synchronized across all PEs spanning the application.

Load balancerMeasures how well processes and applications are acted upon and serviced in each scheduling domain. Based on this information, the load balancer may decide to move commands and applications among eligible PEs in each domain.

MUSEImplements a scheduling strategy similar to the fair-share scheduler in UNICOS. MUSE allows the system to be shared among groups in an organized way by assigning resources to the most deserving process.

Resource managerCollects and analyses information about resource usage within the machine for internal and external use. The object manager then makes this information available in a uniform way to service providers such as NQE.

Page 23: e-mail pawelw@man.poznan.pl             man.poznan.pl

23

Gang schedulingWszystkie procesy aplikacji są przydzielane do

zasobów w tym samym czasie

Parametry•Heartbeat – długość kwantu czasu przydzielanego aplikacji

•Partial - pozwala na częściowe szeregowanie w razie wolnych zasobów

•Variation - wariacja kwantu czasu

Page 24: e-mail pawelw@man.poznan.pl             man.poznan.pl

24

Load balancingw domenie interaktywnej

Przenoszenie procesów pomiędzy procesorami w zależności od wykorzystywanych zasobów.

Uwzględnia się koszt przeniesienia zadań.

Page 25: e-mail pawelw@man.poznan.pl             man.poznan.pl

25

Load balancing w domenie aplikacyjnej•Minimize swapping

• Minimize migration cost

• Perform expensive migrations only when necessary

• Minimize the number of parties

• Maximize the contiguously allocated PEs per party

Parametry

Heartbeat - częstotliwość

MigrationDelay – minimalny czas pomiędzy migracjami tej samej aplikacji

MigrationGravity – w którą stronę przesuwać aplikacje (w dół, w górę, w obie)

NoPreemptiveMigration – migracje tylko jeżeli są kolejne aplikacje do uruchomienia.

Page 26: e-mail pawelw@man.poznan.pl             man.poznan.pl

26

MUSE schedulerPrzydział ustalonego procentu czasu CPU niezależnie od ilości procesów użytkownika

Page 27: e-mail pawelw@man.poznan.pl             man.poznan.pl

27

psview - MUSE

lotus 9% psview -m APPStatus of MUSE Domain: APP PE Range : 0 - 0x4 Mode : Active Share by : UID Heartbeat : 600 seconds Decay : 3600 seconds OsHeartbeat : 60 seconds

Entitlement MUSE LongTerm Interval Name Absolute Relative Factor Usage Usage Type ------------ -------- -------- -------- -------- -------- --------root 1 1.0000 - - - Root Users 100 0.5000 - 0.9630 - Group komasa 100 0.5000 0.2596 0.9630 0.6824 Active Staff 100 0.5000 - 0.0370 - Group pawelw 100 0.5000 1.0000 0.0370 0.3176 Active

Page 28: e-mail pawelw@man.poznan.pl             man.poznan.pl

28

psview -ganglotus 12% psview -g APPStatus of Gang Scheduler Domain: APP PE Range : 0 - 0x4 Mode : Full Gang Scheduling Gangs : 3 Parties : 2 Time Slice : 50 - 800; Current: 800; Minimum: 5 Status : schedule change pending

Rank Command Name User PE-Range Id Status ==== ================ ======== =========== ====== ======= 0 a.out pawelw 0x003-0x004 19415 -

a.out pawelw 000-0x002 19087 -

1 nel186_4.exe komasa 000-0x003 81257 swapped (1 of 4)

Page 29: e-mail pawelw@man.poznan.pl             man.poznan.pl

GigaRing ChannelGigaRing Channel• The GigaRing channel architecture is a modification of Scalable

Coherent Interface (SCI) specification and is designed to be the common channel that carries information between Input/Output Nodes (ION)

• This channel consists of a pair of 500 MB/s. channels configured as counter-rotating rings

• The two rings form a single logical channel with a maximum bandwidth of 1.0 GB/s. Protocol overhead lowers the channel rate to 920 MB/s.

• A client connects to the GigaRing channel through the ION via a 64-bit full-duplex interface

• Detection of lost packets and cyclic redundancy checksums

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Page 30: e-mail pawelw@man.poznan.pl             man.poznan.pl

GigaRing ChannelGigaRing ChannelThe counter rotating rings provide two forms of system resiliency:

• Ring folding

• Ring masking

GigaRing Node Interface

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Client-specificChip

GigaRing NodeChip

64 - bit Client Port

GigaRing Node

Positive In Link Positive Out Link

Negative In LinkNegative Out Link

Page 31: e-mail pawelw@man.poznan.pl             man.poznan.pl

Ring Folding• The GigaRing channel can be software configured to map out one or

more IONs from the system. Ring folding converts the counter-rotating rings to form a single ring

• The maximum channel bandwith for a folded ring is approximately 500 MB/s

GigaRing ChannelGigaRing Channel

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

ION

ION

ION

IONION

ION GiGaRingChannel

Page 32: e-mail pawelw@man.poznan.pl             man.poznan.pl

Ring Masking• Ring masking removes one of the counter-rotating rings from the

system, which results in one fully connected, uniderectional ring

• The maximum channel bandwidth = 500 MB/s

GigaRing ChannelGigaRing Channel

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

IONION

ION

ION

GigaRingGigaRingChannelChannel

Page 33: e-mail pawelw@man.poznan.pl             man.poznan.pl

Input/Output Nodes (ION)• All devices that connect directly to the GigaRing channel are

considered to be IONs

• There are three types of IONs :

Single-purpose Node (SPN)

Multipurpose node (MPN)

Mainframe node • Available mainframe nodes :

GigaRing ChannelGigaRing Channel

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Cray T3ECray T3ECray J90seCray J90se

Cray T90Cray T90

Page 34: e-mail pawelw@man.poznan.pl             man.poznan.pl

GigaRing ChannelGigaRing Channel

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

GigaRing Channel

Cray T3E

Cray T3E

Cray T90

Cray J90se

HPN-2 (HIPPI)

HIPPI Network

Disk Array

Cray J90se

Cray J90

Page 35: e-mail pawelw@man.poznan.pl             man.poznan.pl

SuperCluster EnvironmentSuperCluster EnvironmentPoznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Parallel Vector Supercomputers J90

Cray T3E Cray T90

HIPPI Switch

HIPPI Disk Array

PVM

NQE

NFS

DFS

DCE

EthernetFDDI

ATM

HeterogenousWorkstation

Servers

HIPPI

Page 36: e-mail pawelw@man.poznan.pl             man.poznan.pl

• Job distribution and load balancing

Cray NQX (NQE for Unicos)

• Open systems remote file access:

NFS

• Standard, secured distributed file system:

DCE DFS Server

• Client/server based distributed computing:

DCE Client Services

• Cray Message Passing Toolkit (MPT):

PVM, MPI

• High performance, resilient file sharing: opt.

Shared File System (SFS)

• Client/server hierarchical storage management: opt.

Data Migration Facility (DMF)

SuperCluster Software ComponentsSuperCluster Software ComponentsPoznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Page 37: e-mail pawelw@man.poznan.pl             man.poznan.pl

Network Queuing Environment (NQE)• NQE consists of four components :

Network Queuing System (NQS), Network Load Balancer (NLB)

File Transfer Agent (FTA), Network Qeuing Environment clients

• NQE is a batch queuing system that automatically load balances jobs across heterogenous systems on a network. It runs each job submitted to the network as efficiently as possible on the ressources available.

• This provides faster turnaround for users and automatic load balancing to ensure that all systems on the network are used effectively.

SuperCluster Software ComponentsSuperCluster Software ComponentsPoznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

NQS

NLB server

FTA Collector

NQS FTA

Collector

NQE Clients NQE master server NQE execution servers

Page 38: e-mail pawelw@man.poznan.pl             man.poznan.pl

• Consists of up to eight Power Challenge or Power Onyx (POWERnode) supercomputing systems connected by a high performance HIPPI interconnect

• Two level communication hierarchy, whereas CPUs within a POWERnode communicate via a fast shared bus interconnect and CPUs across POWERnode communicate via HIPPI interconnect

POWER CHALLENGEarrayPOWER CHALLENGEarray

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

M

P PP

M

P PP

M

P PP

M

P PP

HiPPIswitch

Page 39: e-mail pawelw@man.poznan.pl             man.poznan.pl

Parallel programming models supported:

• Shared memory with n processes inside a POWERnode

• Message passing with n processes inside a POWERnode

• Hybrid model with n processes inside a POWERnode, using a combination of shared memory and message passing

• Message passing with n processes over p POWERnodes

• Hybrid model with n processes over p POWERnodes, using a combination of shared memory within a POWERnode system and message passing between POWERnodes

POWER CHALLENGEarrayPOWER CHALLENGEarray

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Page 40: e-mail pawelw@man.poznan.pl             man.poznan.pl

Shared Memory

MPI Task

MPI Task

Communicationvia sockets

MPI Task

MPI Task MPI Task

MPI Task

Shared Memory

Multiparallel Memory Sharing

Message Passing MPI Model

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

Page 41: e-mail pawelw@man.poznan.pl             man.poznan.pl

Software:• Native POWERnode tools

IRIX 6.x, XFS, NFS, MIPSpro compilers, scientific and math libraries,

development environment

• Array services

Allows to manage and administer the array as a single system

• Distributed program development tools

HPF, MPI and PVM libraries, tools for distributed program visualization and debugging (Upshot, XPVM)

• Distributed batch processing tools

LSF, CODINE

• Distributed system management tools

IRIXPro, Performance Co-Pilot (PCP)

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

POWER CHALLENGEarrayPOWER CHALLENGEarray

Page 42: e-mail pawelw@man.poznan.pl             man.poznan.pl

An array session is a set of processes, possibly running across several POWERnodes, that are related to another by a single, unique identifier called the Array Session Handle (ASH). A local ASH is assigned by the kernel and is guaranteed to be unique within a single POWERnode, whereas a global ASH is assigned by the array services daemon

and is unique across the entire POWER CHALLENGEarray.

Poznań Supercomputing and Networking CenterPoznań Supercomputing and Networking Center

POWERnode4

arrayservicesdaemon

POWERnode3

arrayservicesdaemon

POWERnode1

arrayservicesdaemon

POWERnode2

arrayservicesdaemon

ARRAY 1

ArraySession

Process 2

Process 1

Process 3

Page 43: e-mail pawelw@man.poznan.pl             man.poznan.pl

Poznańskie Centrum Superkomputerowo-Sieciowe

Parallel Operating Environment• Parallel Operating Environment - środowisko do

pracy równoległej• Upraszcza uruchamianie programów równo-

ległych• Jeden punkt zarządzania - konsola wspólna dla

wszystkich procesów• Proste konfigurowanie przy pomocy zmiennych

środowiskowych (lub parametrów)• MPL, MPI, własne programy równoległe lub

nawet seryjne

Page 44: e-mail pawelw@man.poznan.pl             man.poznan.pl

44

Parallel Operating EnvironmentThe POE consists of parallel compiler scripts, POE environment variables, parallel

debugger(s) and profiler(s), MPL, and parallel visualization tools. These tools allow

one to develop, execute, profile, debug, and fine-tune parallel code.

The Partition Manager controls a partition, or group of nodes on which you wish to run your program. The Partition Manager requests the nodes for your parallel job, acquires the nodes necessary for that job (if the Resource Manager is not used), copies the executables from the initiating node to each node in the partition, loads executables on every node in the partition, and sets up standard I/O.

The Resource Manager keeps track of the nodes currently processing a parallel task, and, when nodes are requested from the Partion Manager, it allocates nodes for use. The Resource Manager attempts to enforce a ``one parallel task per node” rule.

The Processor Pools are sets of nodes dedicated to a particular type of process (such as interactive, batch, I/O intensive) which have been grouped together by the system administrator(s).

Page 45: e-mail pawelw@man.poznan.pl             man.poznan.pl

45

What is POE? POE encompasses a collection of software tools designed to

provide an environment for developing, executing, debugging and profiling parallel C, C++ and Fortran programs.

• Facilities to manage your parallel execution environment (environment variables and command line flags)

• Message Passing Interface (MPI) library for interprocess communications •Subset of MPI-2 •Low-level Application Programming Interface (LAPI) •Parallel compiler scripts •Parallel file copy utilities •Authentication utilities •Parallel debuggers •Parallel profiling tools •Dynamic probe class library (DPCL) parallel tools development API

Page 46: e-mail pawelw@man.poznan.pl             man.poznan.pl

46

What is POE? Much of what POE does is designed to be transparent to the

parallel user. Some of these tasks include:

• Linking the necessary parallel libraries during compilation (via parallel compiler scripts)

• Finding and acquiring machines (nodes) for your parallel job • Loading your executable onto all nodes acquired for your parallel job • Handling all stdin, stderr and stdout between the nodes of a your parallel job • Signal handling for all tasks in your job • Providing intertask communications facilities • Managing the use of processor and network adapter resources • Retrieving system and job status information when requested • Error detection and reporting • Providing support for run-time profiling and analysis tools

Page 47: e-mail pawelw@man.poznan.pl             man.poznan.pl

47

Basic POE Environment VariablesMP_PROCS The number of task processes for your parallel job. May be used alone or in conjunction with

MP_NODES and/or MP_TASKS_PER_NODE to specify how many tasks are loaded onto a physical SP node. The maximum value for MP_PROCS is dependent upon the version of PE software installed (currently ranges from 128 to 2048) If not set, the default is 1.

MP_NODES Specifies the number of physical nodes on which to run the parallel tasks. May be used alone or in

conjunction with MP_TASKS_PER_NODE and/or MP_PROCS.

MP_TASKS_PER_NODE Specifies the number of tasks to be run on each of the physical nodes. May be used in conjunction

with MP_NODES and/or MP_PROCS.

MP_RESD Specifies whether or not LoadLeveler should be used to allocate nodes. Valid values are either "yes"

(non-specific node allocation) or "no" (specific node allocation). If not set, the default value is context sensitive to other POE variables. Batch systems typically override/ignore user settings for this environment variable.

Page 48: e-mail pawelw@man.poznan.pl             man.poznan.pl

48

Basic POE Environment VariablesMP_RMPOOL Specifies the SP system pool number that should be used for non-specific node allocation. This is only valid

if you are using the LoadLeveler for non-specific node allocation (from a single pool) without a host list file. Batch systems typically override/ignore user settings for this environment variable.

MP_HOSTFILE This environment variable is used only if you wish to explicitly select which nodes will be allocated for your

POE job (specific node allocation). If you prefer to let LoadLeveler automatically allocate nodes then this variable should be set to NULL or "". If used, this variable specifies the name of a file which contains the actual machine (domain) names of nodes you wish to use. It can also be used to specify which pools should be used. The default filename is "host.list" in the current directory.

MP_EUILIB Specifies which of two protocols should be used for task communications. Valid values are either "ip" for

Internet Protocol or "us" for User Space protocol. The default is "ip", while "us" is faster.

MP_EUIDEVICE A node may be physically connected to different networks. This environment variable is used to specify

which network adapter should be used for communications. Valid values are: "en0" (ethernet), "fi0" (FDDI), "tr0" (token-ring), or "css0" (high-performance switch). Note that valid values will also depend upon the actual physical network configuration of the node.

Page 49: e-mail pawelw@man.poznan.pl             man.poznan.pl

49

System Status Array The leftmost area represents a list of POE

jobs which the Resource Manager knows about. Clicking on one of these jobs selects it.

The rightmost area provides a list of node names; nodes are listed in order from left to the right and from top to the bottom.

The central area provides a grid of squares, each square representing a machine/node.

Pink squares represent low utilized nodes. Yellow squares represent high utilized

nodes. Gray squares are nonexistent nodes or

nodes that are not available for monitoring. Squares with green boxes indicate which

nodes are associated with a selected POE job number.

Page 50: e-mail pawelw@man.poznan.pl             man.poznan.pl

50

DCE1.DCE provides tools and services that support distributed

applications.

(DCE RPC, DCE Threads, DCE Directory Service, Security Service and Distributed

Time Service,

2.DCE's set of services is integrated and comprehensive.

3.DCE provides interoperability and portability across

heterogeneous platforms.

4.DCE supports data sharing.

5.DCE participates in a global computing environment. (X.500 and Domain Name Service (DNS))

Page 51: e-mail pawelw@man.poznan.pl             man.poznan.pl

51

Potential Users of DCE1.An office with isolated computing resources can network the computers together and use DCE for

data and resource sharing.

2.An organization consisting of multiple computing sites that are already interconnected by a network can use DCE to tie together and access resources across the different sites.

3.Any computing organization comprising, or expecting to comprise in the future, more cooperating hosts than can be easily administered manually

4.Organizations that write distributed applications can use DCE as a platform for their software. Applications that are written on DCE can be readily ported to other software and hardware platforms that also support DCE.

5.Organizations wishing to use applications that run on DCE platforms.

6.Organizations that wish to participate in networked computing on a global basis.

7.System vendors whose customers are in any of the preceding categories.

8.Organizations that would like to make a service available over the network on one system (for example, a system running a non-UNIX operating system), and have it accessible from other kinds of systems (for example, workstations running UNIX).

Page 52: e-mail pawelw@man.poznan.pl             man.poznan.pl

52

DCE Models of Distributed Computing

• The Client/Server Model

• The Remote Procedure Call Model

• The Data Sharing Model

• The Distributed Object Model

Page 53: e-mail pawelw@man.poznan.pl             man.poznan.pl

53

Architektura DCE

Page 54: e-mail pawelw@man.poznan.pl             man.poznan.pl

54

DCE client

DCE client

CDS server

DCE API

DFS File server

DCE API

Security server

DCE API

DCE client

GDS Server

DCE API

GDA agent

network

DTS server

UFS LFS

DNS

X.500

Time provider

Page 55: e-mail pawelw@man.poznan.pl             man.poznan.pl

55

Architectural Overview of DCE• DCE Threads supports the creation, management, and synchronization of

multiple threads of control within a single process. This component is conceptually a part of the operating system layer, the layer below DCE. DCE threads are used by other DCE components and are also available for applications to use.

• The DCE Remote Procedure Call facility consists of both a development tool and a runtime service. The development tool consists of a language (and its compiler) that supports the development of distributed applications following the client/server model. It automatically generates code that transforms procedure calls into network messages. The runtime service implements the network protocols by which the client and server sides of an application communicate. DCE RPC also includes software for generating unique identifiers, which are useful in identifying service interfaces and other resources.

Page 56: e-mail pawelw@man.poznan.pl             man.poznan.pl

56

Architectural Overview of DCE• The DCE Directory Service is a central repository for information about resources in the

distributed system. Typical resources are users, machines, and RPC-based services. The information consists of the name of the resource and its associated attributes. Typical attributes could include a user's home directory, or the location of an RPC-based server.

• The DCE Directory Service comprises several parts: the Cell Directory Service (CDS), the Global Directory Service (GDS), the Global Directory Agent (GDA), and a directory service programming interface. CDS manages a database of information about the resources in a group of machines called a DCE cell. (Cells are described in the next section.) GDS implements an international standard directory service and provides a global namespace that connects the local DCE cells into one worldwide hierarchy. GDA acts as a go-between for cell and global directory services. Both CDS and GDS are accessed using a single directory service application programming interface, the X/Open Directory Service (XDS) Advanced Programming Interface (API).

Page 57: e-mail pawelw@man.poznan.pl             man.poznan.pl

57

Architectural Overview of DCE

• The DCE Distributed Time Service (DTS) provides synchronized time on the computers participating in a Distributed Computing Environment. DTS synchronizes a DCE host's time with Coordinated Universal Time (UTC), an international time standard.

• The DCE Security Service provides secure communications and controlled access to resources in the distributed system. There are four aspects to DCE security: authentication, secure communications, authorization, and auditing. These aspects are implemented by several services and facilities that together constitute the DCE Security Service, including the registry service, the authentication service, the privilege service, the access control list (ACL) facility, the login facility, and the audit service.

Page 58: e-mail pawelw@man.poznan.pl             man.poznan.pl

58

Architectural Overview of DCE• The DCE Distributed File Service allows users to access and share files stored

on a file server anywhere on the network, without having to know the physical location of the file. Files are part of a single, global namespace, so no matter where in the network a user is, the file can be found by using the same name. DFS achieves high performance, particularly through caching of file system data, so that many users can access files that are located on a given file server without prohibitive amounts of network traffic and resulting delays.

• DCE DFS includes a physical file system, the DCE Local File System (LFS), which supports special features that are useful in a distributed environment. They include the ability to replicate data; log file system data, enabling quick recovery after a crash; simplify administration by dividing the file system into easily managed units called filesets; and associate ACLs with files and directories.

Page 59: e-mail pawelw@man.poznan.pl             man.poznan.pl

59

Security Service

Zarządza bezpieczeństwem zasobów wchodzących w skład komórki

• Authentication service wiarygodna, wzajemna identyfikacje komunikujących się procesów

• Privilege serviceautoryzacja dostępu do zasobów i bezpieczne przekazywanie informacji identyfikujących użytkownika w środowisku rozproszonym

• Registry servicezarządzanie i utrzymywanie replikowanej bazy danych o użytkownikach, ich grupach i serwerach usług (principals) dostępnych w komórce

• ACL facilitykontrola dostępu do zasobów w postaci list kontroli dostępu ACL

• Login facilityautoryzacja użytkowników w środowisku DCE

• Audit servicemonitorowanie i zapis operacji związanych z wszystkimi serwisami bezpieczeństwa

Page 60: e-mail pawelw@man.poznan.pl             man.poznan.pl

60

Security Service

Page 61: e-mail pawelw@man.poznan.pl             man.poznan.pl

61

Distributed File Service (DFS)

• ukrywa detale dotyczące fizycznej lokalizacji plików• udostępnia pliki i katalogi DFS w obrębie wszystkich węzłów przy użyciu tej samej nazwy• zapewnia replikację, kontrolę dostępu do obiektów (ACL) oraz szybką naprawę po ewentualnej awarii (crash-recovery)

Rozproszony, replikowany system plików zbudowany w oparciu o serwis nazw DCE i serwis bezpieczeństwa

• Cache ManagerZarządza mechanizmem buforowania po stronie klienta• File Exporterobsługuje żądania klientów DFS do eksportowanego systemu plików.• Token Managerzarządza współdzielonym dostępem do systemu plików (concurrent access). • Replication ServerJest odpowiedzialny za utrzymywanie spójności replik między serwerami DFS komórki. • Update ServerPozwala na łatwą dystrybucję nowych wersji oprogramowania wewnątrz komórki.• Backup ServerZapewnia automatyczne tworzenie kopii bezpieczeństwa zbiorów plików

Page 62: e-mail pawelw@man.poznan.pl             man.poznan.pl

62

Znajdowanie usługi w DCE

Page 63: e-mail pawelw@man.poznan.pl             man.poznan.pl

63

DCE RPC

Page 64: e-mail pawelw@man.poznan.pl             man.poznan.pl

64

System kolejkowy LSF na bazie DCE

Zalety wykorzystania DCE do połączenia systemów obliczeniowych przy pomocy LSF:

• centralne zarządzanie użytkownikami (Registry Service)• globalna przestrzeń nazw• wspólny system plików (DFS)

• podwyższona efektywności (buforowanie)• niezawodność (replikacja danych)

• kontrola integralności i zapewnienie poufności danych• szyfrowanie transmisji • listy kotroli dostępu

• wzmocnione mechanizmy autoryzacji i uwierzytelniania użytkowników• model Key Distribution Center (KDC)• identyfikacja przy wykorzystaniu papierów uwierzytelniających