© 2009 IBM Corporation High Performance Power System 효율적으로 사용하기 High Performance Power System Date. 15/10/2009 DongJoon Cho ([email protected]) MTS, GTS,

© 2009 IBM Corporation

High Performance Power System 효율적으로 사용하기

High Performance Power System

Date. 15/10/2009DongJoon Cho ([email protected])MTS, GTS, IBM Korea


Agenda

• Concerns about Power System

• Summary of the solutions

• Architectures for effective computing– H/W Architecture– System Architecture– S/W Architecture


Concerns about Power System

• 왜 고성능 Server 를 구매해놓고 100% 활용을 하지 못할까 ?

• CPU Clock 은 높아졌는데 왜 Application 성능은 나오지 않는 걸까 ?

• Clock 은 2 배로 빨라졌는데 왜 성능은 2 배가 되지 않는 걸까 ?

• Memory 를 2 배로 추가했는데 왜 사용률이 ½ 로 떨어지지 않는 걸까 ?

• IBM Power System 은 왜 다른 System 에 비해 tpmC 가 높게 나올까 ?

• IBM Power System 은 response time 은 좋은데 왜 사용량이 높을까 ?S/W 의 변화 없이 System

만바꾼다고 성능이 향상될까 ?

System 에 대해 CPU Clock

이외에 무엇을 더 알고 있을까 ?


Summary of the solutions

• 간접적 방법– Firmware update– AIX update– Software update

• 직접적 방법– AIX configuration– Plan/Selection Hardware– System Architecture– Software Architecture

대부분의 software 문제는 개발시간 및 비용문제로 인해 직접적인 방법으로 해결하기

어려움


Hardware Architecture - CPU

• CISC– Complex Instruction Set Computer Architecture– 필요한 모든 명령어 셋을 갖추도록 설계– VAX, x86

• EPIC– Explicitly Parallel Instruction Computing Architecture– HP/Intel 공동 설계 , 명시적 병렬 처리를 제공– IA64

• RISC– Reduced Instruction Set Computer Architecture– 명령어 셋 자체를 가장 자주 사용되는 명령어만으로 개수를 줄임으로써

대부분의 활용 업무 면에서 소요시간을 단축할 수 있도록 설계– SPARC, POWER, PA-RISC


Hardware Architecture - CPU Instructions

• Computation Instructions

• Operands Types

Arithmetic operations Logical operations

ADD Add AND True if A and B true

SUB Subtract OR True if A or B true

MUL Multiply NOT True if A is false

DIV Divide XOR True if only one of

INC Increment A and B is true

DEC Decrement SHL Shift bits left

CMP Compare SHR Shift bits right

BSWAP Reverse byte order

Stack Accumulator Register Memory

Push A Ld A Ld R1, A Add C, B, A

Push B Add B Ld R2, B

Add St C Add R3, R2, R1

Pop C St C, R3



• Data Transfer InstructionsLD Load value from memory to a register

ST Store value from a register to memory

MOV Move value from register to register

CMOV Conditionally move value from register to register if a condition is met

PUSH Push value onto top of stack

POP Pop value from top of stack



• Control Flow Instructions

• Control Flow Relative Frequency

JMP Unconditional jump to another instruction

BR Branch to instruction if condition is met

CALL Call a procedure

RET Return from procedure

INT Software interrupt

Instruction Integer programs Floating-point programs

Branch 75% 82%

Jump 6% 10%

Call & return 19% 8%



• Common InstructionsInstruction Instruction type Percent of

instructions executed

Instruction type Overall percentage

Load Data transfer 22% Data transfer 38%

Branch Control flow 20% Computation 35%

Compare Computation 16% Control flow 22%

Store Data transfer 12%

Add Computation 8%

And Computation 6%

Sub Computation 5%

Move Data transfer 4%

Call Control flow 1%

Return Control flow 1%

Total 95%


Hardware Architecture - CPU and I/O

• CPU Speed versus I/O Speeds

• Several options to overcome I/O limitations– Incorporate more I/O buses (parallelism)– Extend current I/O technology (increase bandwidth, enhance

operating modes)– Develop new I/O technology

CPU 보다 느린 I/O

I/O 에 의한 wait 를 줄이는 여러 기술 필요


Hardware Architecture - CPU and I/O

• CPU Efficiency and CPU Access Costs

I/O 에 의한 성능 저하


Hardware Architecture - I/O

• The elements of an I/O system


Hardware Architecture - I/O : InfiniBand

• Comparing InfiniBand to Existing Technology– Differences and Benefits

Change Benefit

From: To:

Memory mapped Channel based CPU efficiency, scalability, isolation, recovery.

Parallel bus Switched fabric Scalability, isolation, redundancy, reduced pin-out, modularity, higher cross-sectional bandwidth.

Shared bus access Point to point Greater distance, higher speeds.

Load/store DMA scheduling Improved CPU efficiency.

Single open address space Independent address domains Protection, isolation, recovery, reliability.



Shared Bus Topology Shared Bus ArchitectureSwitched Fabric Topology

InfiniBand Switched Architecture

traditional

InfiniBand

InfiniBand Architecture



Accessing InfiniBand Services - The Channel Interface : Work / Completion Queue Architecture



InfiniBand Queue Operations – Operations on the send queue fall into three subclass

Queue 를 통해 wait 최소화 , 비동기 처리



• VIA (Virtual Interface Architecture)– Messages Model– Direct, protected access by user level software to the

communications hardware; the protection is effected by means of the virtual memory system.

Comparison of VIA and traditional communications

Send and receive packet descriptors that specify scatter-gather operations—specifying where data must be distributed to and collected up from—when sending and receiving

A send message queue and a receive message queue, comprising linked lists of packet descriptors

A means of notifying the network interface that packets have been placed on a queue

An asynchronous notification process for the status of the operations requested (completion of a send or receive operation is signaled by writing state information into a packet descriptor)

Registration of memory areas used for communications: before communications are started, the memory areas for each hardware unit are identified and noted, allowing expensive operations, such as locking the pages, to be used and translating from virtual to real addresses to be done once, outside performance-critical data transfers



Logical processing steps in TCP/IP

White indicates per-message processing: it is the processing load imposed by the system call on the sockets interface, and is independent of the size of the message

Light gray indicates per-fragment processing (a long message is broken up into several fragments): this covers TCP, IP, media access and interrupt handling

Dark grey indicates per-byte processing (actually, per fragment plus per byte in fragment): this covers the data-copying overhead along with computation of the checksum

Checksum 계산 , memory 관리에 의해서도 overhead

발생



Operation

Simple DMA Improved DMA

Send

•set up the DMA registers (with buffer address and size)•lock the page containing the buffers and purge corresponding addresses in the data cache•activate the send command•wait until the end of the operation•interrupt upon completion of the operation, and free (unlock) the page

•refill the free buffers with data to be sent•lock the buffer page(s) and purge corresponding addresses in the data cache•refill a descriptor with the addresses and sizes of the buffers just set up•change the descriptor status indicator to "DMA"•if the DMA was inactive, wake it up

Receive

•DMA interrupts processor•allocate a page and purge the cache of its addresses•set up the DMA registers (with buffer address and size)•when the operation completes the DMA will raise an interrupt

•refill descriptor(s) for receiving•purge corresponding addresses in the data cache•when a receive operation completes, DMA sets the descriptor indicator to System; the OS can test the status of different descriptors•if there are no free buffers, the DMA raises an interrupt

• Mechanisms to reduce the number of interrupts

개선된 DMA 방식으로 interrupt 횟수를 줄여

overhead 를 줄임


System Architecture (Hardware)

• LPAR / DLPAR



• LPAR / DLPAR– Hypervisor



• Micro Partitioning– 프로세서당 최대 10 개의 파티션 작성– 여러 파티션 간 자원 공유



• Micro Partitioning



• VIO– Part of the Advanced POWER Virtualization feature– Allows for sharing of physical devices, including storage and network– Implemented as a customized AIX-based appliance– Requires careful planning to maintain VIO Server with minimal impact to VIO– Clients– Provides command line tools for maintenance or can be maintained with NIM


System Architecture (System Software)

• SMT (Simultaneous Multi-Threading)– POWER5 에서 향상된 하드웨어 디자인으로 프로세서가 동시에 두 개의 개별

instruction 을 실행할 수 있는 기능– 하드웨어와 소프트웨어 thread 의 우선 순위 선정을 통해서 어플리케이션의

성능에 지장을 주지 않고 더 많은 하드웨어 자원의 사용률을 증대

• WLM (Workload Manager)– 시스템을 분할하지 않고서 운영중인 업무간에 동적으로 시스템자원을 할당– CPU 프로세서 단위가 아닌 CPU 시간을 분할하여 관리하므로 보다

세밀하게 CPU 자원을 제어– CPU 시간 , 메모리 , 입출력량 등의 개별적 제어를 통해 특성이 다른 여러

종류의 어플리케이션들을 하나의 서버상에서 관리


System Architecture (System Software)

• WPARs (Workload Partitions)– A workload partition (WPAR), new with the IBM® AIX® 6.1

operating system, expands on the traditional IBM AIX logical partitioning (LPAR) technology by further allowing AIX to be virtualized within a single operating-system image.

– A simple definition of a WPAR is that it is a virtualized AIX instance that runs within a single AIX operating-system image.


Software Architecture - OS

• OS 와 Network Program 과의 관계– Network Program 의 구성요소

• Socket API• I/O• Multi Connection 처리를 위한 Process or Thread• Process or Thread 를 동기화하기 위한 IPC(Inter Process Communication)

H/W (Disk, NIC …)

OS

File System, Memory

Socket API

I/OProcessThread

IPC

OS 와 Network Program 과의 관계


Software Architecture – File on the Unix

• What is the File on the Unix?• Process 가 열고 있는 file 확인

– Process 가 생성되면 기본적으로 Open 하는 File• 0 : 표준입력• 1 : 표준출력• 2 : 표준오류

office2@root/proc/9804/fd>ls -altotal 120dr-x------ 1 root system 0 Sep 27 03:22 .dr-xr-xr-x 1 root system 0 Sep 27 03:22 ..lr-xr-xr-x 24 root system 1024 Sep 22 18:48 0 -> /lr-xr-xr-x 24 root system 1024 Sep 22 18:48 1 -> /lr-xr-xr-x 24 root system 1024 Sep 22 18:48 2 -> /--w--w---- 1 root system 12506 Sep 15 18:13 7--w--w---- 1 root system 12506 Sep 15 18:13 8--w--w---- 1 root system 12506 Sep 15 18:13 9


Software Architecture

• Application Programs and OS– Type of Software (Conceptual Model)



• Application Programs and OS– Application Programs



• Application Programs and OS– Operating Systems



• Application Programs and OS– Device Drivers



• Application Programs and OS– AIX 5L Structure


Application Architecture

• Multi-Process Model– Process : Program 이 실행될 때 생성되는 Program 을 대표하는

제어흐름과 System 자원 (memory,file,IPC…) 등을 의미– Process 생성 및 제어

• fork()– Process 복사본 생성– 자신과 코드를 공유하는 Child Process 생성

• exec()– 현재 Process 에 Program 의 실행 이미지를 변경– 새로운 Program 을 Load 해서 실행

Init Process Process’

Process A

Process’’

Process B

fork() fork()

exec()exec()

Multi-processing 으로 인한IPC 는 kernel overhead 를

증가시킴



• Multi-Processing Model– Socket Program

socket

bind

listen

accept

read

write

socket

connect

write

read

close

Server Client

연결요청

데이터 요청

데이터 수신

fork()



• Multi-Processing Model

Server App

Server AppServer App


Client App

Client App

Client App

Client App…



Client App

Client App

Client App

Client App

fork() ① connecting Client to Server ② fork()

요청이 있을 때마다 fork() 가 일어난다 .

1

2

process Pool ① fork() ② connecting Client to Server

fork() 시간이 오래 걸리므로 pool 에 미리 fork() 를 해서 child processes 를 만들어 놓는다 .

12



• IPC (Inter Process Communication)– What is IPC?

• Process 간에 data 를 공유하고 동기화하기 위해 사용하는 방법

– IPC 종류• Semaphore

– 세마포어는 프로세스간 데이타를 동기화하고 보호• Shared Memory

– 다중프로세스들이 가상메모리를 공유 , 메모리 공유를 위한 가장 빠른 수단• Message Queues

– queue 는 자료구조의 한종류인데 , 먼저 들어온 자료가 먼저 나가는 구조– 메시지큐의 IPC 로써의 특징은 다른 공유방식에 비해서 사용방법이 매우 직관적이고

간단– 제어하기가 상당히 까다롭다 .



• IPC– IPC 종류

• Pipe– 프로세스의 데이타를 다른 프로세스에게 넘기기 위한 목적으로 사용 . 데이타는

한쪽방향으로만 흐를수 있으며 ( 읽거나 쓸수만 있고 , 동시에 읽고 쓰기를 할수는 없다 .- Read only or Write only), 동일한 부모를 (PPID 가 같은 ) 가지는 process 사이에서만 사용이가능 하다

• FIFO (Named Pipe)– 연속처리 I/O STREAM 선입선출로 Pipe 와 비슷하나 이름을 부여해 서로다른

Process 사이의 사용이 가능한것이 Pipe 와 다른점– mknod 를 이용하여 FIFO 를 생성

• UDS (Unix Domain Socket)– socket API 를 수정없이 이용가능하며 , port 기반의 Internet Domain Socket 에

비해서 로컬 시스템의 파일시스템을 이용해서 내부프로세스간의 통신을 위해 사용한다 .



• IPC– IPC commands

– lpcs comnand• ipcs -m ( shared memory )• ipcs -q ( message gueues )• ipcs -s ( semaphore )

– lpcrm comnand• 세마포어 , 메세지큐 , 공유메모리부분을 시스템에서 제거

기능 메세지큐 세마포어 공유메모리

1.IPC 할당방법 msgget semget shmget

2.IPC 제어방법 msgctl semctl shmctl

( 상태변경 , 해제 )

3.IPC 작동방법 msgsnd semop shmat

(send/receive) msgrcv shmdt



• IPC– IPC Limits

Semaphores 4.3.0 4.3.1 4.3.2 5.1 5.2 5.3

Maximum number of semaphore IDs for 32-bit kernel

4096 4096 131072 131072 131072 131072

Maximum number of semaphore IDs for 64-bit kernel

4096 4096 131072 131072 1310721048576

Maximum semaphores per semaphore ID 65535 65535 65535 65535 65535 65535

Maximum operations per semop call 1024 1024 1024 1024 1024 1024

Maximum undo entries per process 1024 1024 1024 1024 1024 1024

Size in bytes of undo structure 8208 8208 8208 8208 8208 8208

Semaphore maximum value 32767 32767 32767 32767 32767 32767

Adjust on exit maximum value 16384 16384 16384 16384 16384 16384




Message Queue 4.3.0 4.3.1 4.3.2 5.1 5.2 5.3

Maximum message size 4 MB 4 MB 4 MB 4 MB 4 MB 4 MB

Maximum bytes on queue 4 MB 4 MB 4 MB 4 MB 4 MB 4 MB

Maximum number of message queue IDs for 32-bit kernel

4096 4096 131072 131072 131072 131072

Maximum number of message queue IDs for 64-bit kernel

4096 4096 131072 131072 1310721048576

Maximum messages per queue ID 524288 524288 524288 524288 524288 524288




Shared Memory 4.3.0 4.3.1 4.3.2 5.1 5.2 5.3

Maximum segment size (32-bit process) 256 MB 2 GB 2 GB 2 GB 2 GB 2 GB

Maximum segment size (64-bit process) for 32-bit kernel

256 MB 2 GB 2 GB 64 GB 1 TB 1 TB

Maximum segment size (64-bit process) for 64-bit kernel

256 MB 2 GB 2 GB 64 GB 1 TB 32 TB

Minimum segment size 1 1 1 1 1 1

Maximum number of shared memory IDs (32-bit kernel)

4096 4096 131072 131072 131072 131072

Maximum number of shared memory IDs (64-bit kernel)

4096 4096 131072 131072 1310721048576

Maximum number of segments per process (32-bit process)

11 11 11 11 11 11

Maximum number of segments per process (64-bit process)

268435456

268435456

268435456

268435456

268435456

268435456



• IPC– IPC tunable parameters– msgmax

– msgmnb

Purpose: Specifies maximum message size.

Values: Dynamic with maximum value of 4 MB

Display: N/A

Change: N/A

Diagnosis:

N/A

Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.

Purpose: Specifies maximum number of bytes on queue.

Values: Dynamic with maximum value of 4 MB

Display: N/A

Change: N/A

Diagnosis:

N/A




• IPC– IPC tunable parameters– msgmni

– msgmnm

Purpose: Specifies maximum number of message queue IDs.

Values: Dynamic with maximum value of 131072

Display: N/A

Change: N/A

Diagnosis:

N/A


Purpose: Specifies maximum number of messages per queue.


Display: N/A

Change: N/A

Diagnosis:

N/A




• IPC– IPC tunable parameters– semaem

– semmni

Purpose: Specifies maximum value for adjustment on exit.


Display: N/A

Change: N/A

Diagnosis:

N/A


Purpose: Specifies maximum number of semaphore IDs.


Display: N/A

Change: N/A

Diagnosis:

N/A




• IPC– IPC tunable parameters– semmsl

– semopm

Purpose: Specifies maximum number of semaphores per ID.


Display: N/A

Change: N/A

Diagnosis:

N/A


Purpose: Specifies maximum number of operations per semop() call.


Display: N/A

Change: N/A

Diagnosis:

N/A




• IPC– IPC tunable parameters– semume

– semvmx

Purpose: Specifies maximum number of undo entries per process.


Display: N/A

Change: N/A

Diagnosis:

N/A


Purpose: Specifies maximum value of a semaphore.


Display: N/A

Change: N/A

Diagnosis:

N/A




• IPC– IPC tunable parameters– shmmax

– shmmin

Purpose: Specifies maximum shared memory segment size.

Values: Dynamic with maximum value of 256 MB for 32-bit processes and 0x80000000u for 64-bit

Display: N/A

Change: N/A

Diagnosis:

N/A


Purpose: Specifies minimum shared-memory-segment size.

Values: Dynamic with minimum value of 1

Display: N/A

Change: N/A

Diagnosis:

N/A




• IPC– IPC tunable parameters– shmmni

Purpose: Specifies maximum number of shared memory IDs.


Display: N/A

Change: N/A

Diagnosis:

N/A




• Multi-Thread Model– Thread : Process 내에서 존재하는 제어 흐름– Socket Program

socket

bind

listen

accept

read

write

socket

connect

write

read

close

Server Client

연결요청

데이터 요청

데이터 수신

pthread_create()



• Multi-Thread Model

Server App

ThreadThread

ThreadThread

Client App

Client App

Client App

Client App…

ThreadThread

ThreadThread

Client App

Client App

Client App

Client App

pthread_create() ① connecting Client to Server ② pthread_create()

요청이 있을 때마다 pthread_create() 가 일어나지만 , fork() 보다는 훨씬 가볍다 .

1

Thread Pool ① pthread_create() ② connecting Client to Server

fork() 보다는 가볍지만 thread 생성시간 조차도 줄이기 위해 pool 을 사용 .

12

2



• N:N DB Connection (Multi-Process Model)

Oracle

Child processChild process


Child processChild process.…

Process




fork()

.…

fork()

DB Connection 은 n:n 으로 이루어지지만 Oracle 의 fork() 로 인해 system resource 를

낭비

Connection n:n

DB Query 의 가장 큰 load1.DB Connect (from network)2.DB Query 해석



• 1:1 DB Connection (Multi-Process Model)

Oracle

Child process

Process




fork()fork()

DB Connection 은 1:1 로 oracle 의 fork()는 1 회로 제한되어 system resource

낭비가 적지만 client 의 연결이 원활하지 않을 수 있음

Connection 1:1




• DB Connection Pool (Multi-Process Model)– Thread Pool or Process Pool

Oracle

Child process

Process




fork()

Connection n:n

Thread

Thread

Thread

Thread


Child process

Pool 내의 미리 맺어놓은 Connection 으로 처리 , Pool 의 자원을 빌려주는 형태로 , 부족할 때 Pool 의 자원을

유동적으로 할당 가능

Connection Pool

fork()




• DB Connection Pool (Multi-Thread Model)

• Multi Treading Model (①, ⑤)• Thread Pool Model for DB Connection (①, ②, ③, ⑥)

Server App

Thread

Client App

Client App

Client App

Client App…

ThreadThread

ThreadThread

1 2

Oracle



54 3

6

ThreadThread

Thread

Pre-Process Model (Process Pool)Pre-Thread Model (Thread Pool)



• I/O Multiplexing Model– Socket 이 각자의 socket I/O 를 이용하여 통신하지 않고 하나의 socket

I/O 를 통해서 통신하는 방법으로 Socket 을 file descriptor table 에 등록한 후 file descriptor table 의 I/O 를 감시해서 다중 접속을 처리

– select / poll

Server Client연결요청

File descriptor 지정

Server Clientdata 송수신

File descriptor 감시

Server Client연결종료

File descriptor 해제



• I/O Multiplexing Model

– 단점

• I/O Multiplexing 을 위해 selec / poll 을 이용하는데 넓은 범위의 file descriptor array 중에 어떤 file descriptor 에서 event 가 발생하였는지 일일이 loop 를 돌며 확인해야 함

지정한 File descriptorFile descriptor table

모든 File descriptor 를 검사해야 함

I/O Multiplexing Model 의 단점



• Event based I/O Model through Real-time Signal

– Event 기반의 socket 처리 방식

• UNIX/Linux : POSIX Real-time Signal, epoll

• Windows, AIX, iSeries OS : IOCP

• FreedBSD : kqueue (kernel queue)



• Event based I/O Model through Real-time Signal– Real-time Signal

• 대기열이 존재하지 않는 Signal 의 단점과 이로인해 아무런 정보다 전달되지 않는 단점을 보완

• Real-time Signal 은 대기열이 존재하며 , 대기열의 크기만큼 event 를 저장할 수 있어 signal 의 손실을 피할 수 있다 .

• 또한 , real-time signal 을 발생시킨 socket 의 descriptor 등의 정보 전달이 가능하여 , 부가적인 정보를 저장할 수 있다 .

• select / poll 과 같이 file descriptor table 의 descriptor array 를 뒤지지 않아도 된다 .

Socket1 Thread1Client1Client1



SIGRTMIN+1

SIGRTMIN+2

SIGRTMIN+3

Thread-pool 을 이용하여 Real-time signal 을 thread 와 함께 사용


Applicatoin Architecture

• epoll– epoll : event poll– Real-time Signal 보다 약 10% ~ 20% 의 성능 향상– HP-UX, Redhat 지원 , AIX 미지원– Event poll 에 넣고 관리하기 때문에 read/write event 가 발생하면 관련

정보를 return 해줌 . Return 되는 정보는 descriptor 와 같은 정보로 poll과 같은 loop 를 통해 확인할 필요가 없다 .

Socket1Socket1

Socket2Socket2

Socket3Socket3

Event poll

File descriptor



dphttpd symmetric multiprocessor result

• epoll– httpd test result

dphttpd uniprocessor result



• epoll– Pipetest

Pipetest symmetric multiprocessor result Pipetest uniprocessor result



• epoll– Dead connecton test

128bytes context ,Dead connections test result 1024ytes context ,Dead connections test result



• IOCP (I/O Completion Ports)– IOCP on iSeries

• AS/400 부터 지원 , i 는 1988년 AS/400 으로 시작 , AS/400, OS/400, i5/OS, i6/OS 로 발전

• AS/400 QMU 5.0.1.02 introduces asynchronous I/O completion ports (IOCP)

– IOCP on Windows NT• Windows NT Winsock2 부터 지원

– IOCP on AIX• I/O completion port support was first introduced in AIX 4.3 by APAR

IY06351. An I/O completion port was originally a Windows NT scheduling construct that has since been implemented in other OS's. Domino uses these constructs to improve the scalability of the server. It allows one thread to handle multiple session requests, so that a Notes client session is no longer bound to a single thread for its duration. The completion port is tied directly to a device handle and any network I/O requests that are made to that handle.



• Parallel Programming– Fundamental of Parallel Programming

• Multi-Process/Multi-Thread• Asynchronous Procedure Calls• Signal, Event• Queuing Asynchronous Procedure Calls• IOCP

Ex) File Finder Agent



• Parallel Programming – OpenMP(Open Multi-Processing)– An Application Program Interface (API) that may be used to

explicitly direct multi-threaded, shared memory parallelism– Comprised of three primary API components

• Compiler Directives• Runtime Library Routines• Environment Variables

– Portable– Standardized



• Parallel Programming - MPI– MPI (Message Passing Interface)

• Message Passing Parallel Programming 을 위한 Standard Data Communication Library

• References– http://www.mcs.anl.gov/mpi/index.html– http://www.mpi-forum.org/docs/docs.html

– MPI 목표• 이식성 (portability)• 효율성 (efficiency)• 기능성 (functionality)



• Parallel Programming– MPI 기본 개념

• Process 기준으로 작업 할당• Processor : Process = 1:1 or 1:N• Message = data + envelope

– 어떤 process 가 보내는가 ?– 어디에 있는 data 를 보내는가 ?– 어떤 data 를 보내는가 ?– 얼마나 보내는가 ?– 어떤 process 가 받는가 ?– 어디에 저장할 것인가 ?– 얼마나 받을 준비를 해야 하는가 ?

• Tag– Message matching 과 구분에 이용– 순서대로 메시지 도착을 처리할 수 있음– 와일드 카드 사용 가능

• Communicator– 서로간에 통신이 허용되는 프로세스들의 집합



• Parallel Programming– MPI 기본 개념

• Process Rank– 동일한 communicator 내의 process 들을 식별하기 위한 식별자

• Point to Point Communication– 두 개 process 사이의 통신– 하나의 송신 process 에 하나의 수신 process 가 대응

• Collective communication– 동시에 여러 개의 process 가 참여– 1:N, N:1, N:N 대응 가능– 여러 번의 P2P Communication 사용을 하나의 Collective Communication 으로

대체» 오류 가능성 적음 , 최적화로 빠름



• Java– Development and execution of Java applications



• Java– Java application 을 이용하여 System 을 효율적으로 사용하는 방법

• NIO (New I/O)• NIO pollset• Garbage collector 는 자동으로 collect 하도록 나둘 것• 특별한 이유가 없으면 JRE 는 최신으로 update 할 것• 개발시 source code 는 최신으로 유지할 것 (Deprecated 로 명시된 API 는

되도록 다른 API 로 변경하여 사용 )• Framework 을 사용한다면 framework 을 최신으로 유지할 것

반드시 개선됨

JRE 나 Framework 로 인해개선이 안될 수도 있음



• Java– pollset

• Java Source code– DatagramChannel channel = DatagramChannel.open();– Channel = configureBlocking(false);– Selector selector = Selector.open();– Channel.register(selector, SelectKey.OP_READ);– Channel.register(selector, SelectKey.OP_READ);– int poll(struct pollfd fds[], nfds_t nfds, int timeout);

• Native pollset interface C source code– pollset_t ps = pollset_create(int maxfd);– int rc = pollset_destory(pollset_t ps);– int rc = pollset_ctl(pollset_t ps, struct poll_ctl *pollctl_array, int array_length);– int nfound = pollset_poll(pollset_t ps, struct pollfd *polldata_array, int

array_length,

int timeout);



• Java– pollset

• Traditional poll method



• Java– pollset

• pollset method



• Java– pollset

• pollcache internal– pollcache control block



• Java– pollset

• pollset() – bulky update



• Java– pollset

• The throughput performance two drivers(with poll() and with pollset())– pollset driver 가이 poll driver 보다 13.3% 성능 향상



• Java– pollset

• Time spent on CPU


AIX I/O Model

• select / poll

• pollset

• event

• Real-time Signal

• AIO

• IOCP


AIX IOCP

• IOCP

– I/O completion port support was first introduced in AIX 4.3 by APAR IY06351. An I/O completion port was originally a Windows NT scheduling construct that has since been implemented in other OS's.


AIX IOCP

• IOCP– Synchronous I/O versus asynchronous I/O


AIX IOCP

• IOCP– IOCP Operation


AIX IOCP

• IOCP– CreateIoCompletionPort Function

< IOCP on AIX >

#include <iocp.h>int CreateIoCompletionPort (FileDescriptor, CompletionPort, CompletionKey, ConcurrentThreads)HANDLE FileDescriptor, CompletionPort;DWORD CompletionKey, ConcurrentThreads;

< IOCP on Windows >

HANDLE CreateIoCompletionPort (HANDLE FileHandle, // handle to file (socket)HANDLE ExistingCompletionPort, // handle to I/O completion portULONG_PTR CompletionKey, // completion keyDWORD NumberOfConcurrentThreads // number of threads to execute concurrently);


AIX IOCP

• IOCP– How to configure IOCP on AIX

• fileset : bos.iocp.rte

$ lslpp -l bos.iocp.rteThe output from the lslpp command should be similar to the following : Fileset Level State Description ----------------------------------------------------------------------------

Path: /usr/lib/objrepos bos.iocp.rte 5.3.9.0 APPLIED I/O Completion Ports API

Path: /etc/objrepos bos.iocp.rte 5.3.0.50 COMMITTED I/O Completion Ports API

office2@root/>lsdev -Cciocpiocp0 Available I/O Completion Ports

office2@root/>lsattr -Eliocp0autoconfig available STATE to be configured at system restart True


AIX IOCP



AIX IOCP



AIX IOCP



AIX IOCP



AIX IOCP



AIX IOCP

• IOCP API– CreateCompletionPort – GetMultipleCompletionStatus – GetQueuedCompletionStatus – PostQueuedCompletionStatus – ReadFile – WriteFile


iSeries IOCP

• IOCP API– QsoStartAccept– QsoCreateIOCompletionPort– QsoDestroyIOCompletionPort– QsoPostIOCompletion– QsoStartRecv– QsoStartSend– QsoCancelOperation– QsoWaitForIOCompletion


Windows IOCP

• IOCP API– CreateIoCompletionPort– GetQueuedCompletionStatus– GetQueuedCompletionStatusEx– PostQueuedCompletionStatus– ReadFileEx– WriteFileEx– Kernel Functions

• NtCreateIoCompletion, NtRemoveIoCompletion• KeInitializeQueue, KeRemoveQueue• KeInsertQueue• KeWaitForSingleObject• KeDelayExecutionThread• KiActivateWaiterQueue• KiUnwaitThread• NtSetIoCompletion


Q & A

Documents

© 2009 IBM Corporation High Performance Power System 효율적으로 사용하기 High Performance Power System Date. 15/10/2009 DongJoon Cho ([email protected]) MTS, GTS,