Upload
moses-andrews
View
214
Download
0
Embed Size (px)
Citation preview
© 2009 IBM Corporation
High Performance Power System 효율적으로 사용하기
High Performance Power System
Date. 15/10/2009DongJoon Cho ([email protected])MTS, GTS, IBM Korea
© 2009 IBM Corporation
Agenda
• Concerns about Power System
• Summary of the solutions
• Architectures for effective computing– H/W Architecture– System Architecture– S/W Architecture
© 2009 IBM Corporation
Concerns about Power System
• 왜 고성능 Server 를 구매해놓고 100% 활용을 하지 못할까 ?
• CPU Clock 은 높아졌는데 왜 Application 성능은 나오지 않는 걸까 ?
• Clock 은 2 배로 빨라졌는데 왜 성능은 2 배가 되지 않는 걸까 ?
• Memory 를 2 배로 추가했는데 왜 사용률이 ½ 로 떨어지지 않는 걸까 ?
• IBM Power System 은 왜 다른 System 에 비해 tpmC 가 높게 나올까 ?
• IBM Power System 은 response time 은 좋은데 왜 사용량이 높을까 ?S/W 의 변화 없이 System
만바꾼다고 성능이 향상될까 ?
System 에 대해 CPU Clock
이외에 무엇을 더 알고 있을까 ?
© 2009 IBM Corporation
Summary of the solutions
• 간접적 방법– Firmware update– AIX update– Software update
• 직접적 방법– AIX configuration– Plan/Selection Hardware– System Architecture– Software Architecture
대부분의 software 문제는 개발시간 및 비용문제로 인해 직접적인 방법으로 해결하기
어려움
© 2009 IBM Corporation
Hardware Architecture - CPU
• CISC– Complex Instruction Set Computer Architecture– 필요한 모든 명령어 셋을 갖추도록 설계– VAX, x86
• EPIC– Explicitly Parallel Instruction Computing Architecture– HP/Intel 공동 설계 , 명시적 병렬 처리를 제공– IA64
• RISC– Reduced Instruction Set Computer Architecture– 명령어 셋 자체를 가장 자주 사용되는 명령어만으로 개수를 줄임으로써
대부분의 활용 업무 면에서 소요시간을 단축할 수 있도록 설계– SPARC, POWER, PA-RISC
© 2009 IBM Corporation
Hardware Architecture - CPU Instructions
• Computation Instructions
• Operands Types
Arithmetic operations Logical operations
ADD Add AND True if A and B true
SUB Subtract OR True if A or B true
MUL Multiply NOT True if A is false
DIV Divide XOR True if only one of
INC Increment A and B is true
DEC Decrement SHL Shift bits left
CMP Compare SHR Shift bits right
BSWAP Reverse byte order
Stack Accumulator Register Memory
Push A Ld A Ld R1, A Add C, B, A
Push B Add B Ld R2, B
Add St C Add R3, R2, R1
Pop C St C, R3
© 2009 IBM Corporation
Hardware Architecture - CPU Instructions
• Data Transfer InstructionsLD Load value from memory to a register
ST Store value from a register to memory
MOV Move value from register to register
CMOV Conditionally move value from register to register if a condition is met
PUSH Push value onto top of stack
POP Pop value from top of stack
© 2009 IBM Corporation
Hardware Architecture - CPU Instructions
• Control Flow Instructions
• Control Flow Relative Frequency
JMP Unconditional jump to another instruction
BR Branch to instruction if condition is met
CALL Call a procedure
RET Return from procedure
INT Software interrupt
Instruction Integer programs Floating-point programs
Branch 75% 82%
Jump 6% 10%
Call & return 19% 8%
© 2009 IBM Corporation
Hardware Architecture - CPU Instructions
• Common InstructionsInstruction Instruction type Percent of
instructions executed
Instruction type Overall percentage
Load Data transfer 22% Data transfer 38%
Branch Control flow 20% Computation 35%
Compare Computation 16% Control flow 22%
Store Data transfer 12%
Add Computation 8%
And Computation 6%
Sub Computation 5%
Move Data transfer 4%
Call Control flow 1%
Return Control flow 1%
Total 95%
© 2009 IBM Corporation
Hardware Architecture - CPU and I/O
• CPU Speed versus I/O Speeds
• Several options to overcome I/O limitations– Incorporate more I/O buses (parallelism)– Extend current I/O technology (increase bandwidth, enhance
operating modes)– Develop new I/O technology
CPU 보다 느린 I/O
I/O 에 의한 wait 를 줄이는 여러 기술 필요
© 2009 IBM Corporation
Hardware Architecture - CPU and I/O
• CPU Efficiency and CPU Access Costs
I/O 에 의한 성능 저하
© 2009 IBM Corporation
Hardware Architecture - I/O
• The elements of an I/O system
© 2009 IBM Corporation
Hardware Architecture - I/O : InfiniBand
• Comparing InfiniBand to Existing Technology– Differences and Benefits
Change Benefit
From: To:
Memory mapped Channel based CPU efficiency, scalability, isolation, recovery.
Parallel bus Switched fabric Scalability, isolation, redundancy, reduced pin-out, modularity, higher cross-sectional bandwidth.
Shared bus access Point to point Greater distance, higher speeds.
Load/store DMA scheduling Improved CPU efficiency.
Single open address space Independent address domains Protection, isolation, recovery, reliability.
© 2009 IBM Corporation
Hardware Architecture - I/O : InfiniBand
Shared Bus Topology Shared Bus ArchitectureSwitched Fabric Topology
InfiniBand Switched Architecture
traditional
InfiniBand
InfiniBand Architecture
© 2009 IBM Corporation
Hardware Architecture - I/O : InfiniBand
Accessing InfiniBand Services - The Channel Interface : Work / Completion Queue Architecture
© 2009 IBM Corporation
Hardware Architecture - I/O : InfiniBand
InfiniBand Queue Operations – Operations on the send queue fall into three subclass
Queue 를 통해 wait 최소화 , 비동기 처리
© 2009 IBM Corporation
Hardware Architecture - I/O : InfiniBand
• VIA (Virtual Interface Architecture)– Messages Model– Direct, protected access by user level software to the
communications hardware; the protection is effected by means of the virtual memory system.
Comparison of VIA and traditional communications
Send and receive packet descriptors that specify scatter-gather operations—specifying where data must be distributed to and collected up from—when sending and receiving
A send message queue and a receive message queue, comprising linked lists of packet descriptors
A means of notifying the network interface that packets have been placed on a queue
An asynchronous notification process for the status of the operations requested (completion of a send or receive operation is signaled by writing state information into a packet descriptor)
Registration of memory areas used for communications: before communications are started, the memory areas for each hardware unit are identified and noted, allowing expensive operations, such as locking the pages, to be used and translating from virtual to real addresses to be done once, outside performance-critical data transfers
© 2009 IBM Corporation
Hardware Architecture - I/O : InfiniBand
Logical processing steps in TCP/IP
White indicates per-message processing: it is the processing load imposed by the system call on the sockets interface, and is independent of the size of the message
Light gray indicates per-fragment processing (a long message is broken up into several fragments): this covers TCP, IP, media access and interrupt handling
Dark grey indicates per-byte processing (actually, per fragment plus per byte in fragment): this covers the data-copying overhead along with computation of the checksum
Checksum 계산 , memory 관리에 의해서도 overhead
발생
© 2009 IBM Corporation
Hardware Architecture - I/O : InfiniBand
Operation
Simple DMA Improved DMA
Send
•set up the DMA registers (with buffer address and size)•lock the page containing the buffers and purge corresponding addresses in the data cache•activate the send command•wait until the end of the operation•interrupt upon completion of the operation, and free (unlock) the page
•refill the free buffers with data to be sent•lock the buffer page(s) and purge corresponding addresses in the data cache•refill a descriptor with the addresses and sizes of the buffers just set up•change the descriptor status indicator to "DMA"•if the DMA was inactive, wake it up
Receive
•DMA interrupts processor•allocate a page and purge the cache of its addresses•set up the DMA registers (with buffer address and size)•when the operation completes the DMA will raise an interrupt
•refill descriptor(s) for receiving•purge corresponding addresses in the data cache•when a receive operation completes, DMA sets the descriptor indicator to System; the OS can test the status of different descriptors•if there are no free buffers, the DMA raises an interrupt
• Mechanisms to reduce the number of interrupts
개선된 DMA 방식으로 interrupt 횟수를 줄여
overhead 를 줄임
© 2009 IBM Corporation
System Architecture (Hardware)
• LPAR / DLPAR
© 2009 IBM Corporation
System Architecture (Hardware)
• LPAR / DLPAR– Hypervisor
© 2009 IBM Corporation
System Architecture (Hardware)
• Micro Partitioning– 프로세서당 최대 10 개의 파티션 작성– 여러 파티션 간 자원 공유
© 2009 IBM Corporation
System Architecture (Hardware)
• Micro Partitioning
© 2009 IBM Corporation
System Architecture (Hardware)
• VIO– Part of the Advanced POWER Virtualization feature– Allows for sharing of physical devices, including storage and network– Implemented as a customized AIX-based appliance– Requires careful planning to maintain VIO Server with minimal impact to VIO– Clients– Provides command line tools for maintenance or can be maintained with NIM
© 2009 IBM Corporation
System Architecture (System Software)
• SMT (Simultaneous Multi-Threading)– POWER5 에서 향상된 하드웨어 디자인으로 프로세서가 동시에 두 개의 개별
instruction 을 실행할 수 있는 기능– 하드웨어와 소프트웨어 thread 의 우선 순위 선정을 통해서 어플리케이션의
성능에 지장을 주지 않고 더 많은 하드웨어 자원의 사용률을 증대
• WLM (Workload Manager)– 시스템을 분할하지 않고서 운영중인 업무간에 동적으로 시스템자원을 할당– CPU 프로세서 단위가 아닌 CPU 시간을 분할하여 관리하므로 보다
세밀하게 CPU 자원을 제어– CPU 시간 , 메모리 , 입출력량 등의 개별적 제어를 통해 특성이 다른 여러
종류의 어플리케이션들을 하나의 서버상에서 관리
© 2009 IBM Corporation
System Architecture (System Software)
• WPARs (Workload Partitions)– A workload partition (WPAR), new with the IBM® AIX® 6.1
operating system, expands on the traditional IBM AIX logical partitioning (LPAR) technology by further allowing AIX to be virtualized within a single operating-system image.
– A simple definition of a WPAR is that it is a virtualized AIX instance that runs within a single AIX operating-system image.
© 2009 IBM Corporation
Software Architecture - OS
• OS 와 Network Program 과의 관계– Network Program 의 구성요소
• Socket API• I/O• Multi Connection 처리를 위한 Process or Thread• Process or Thread 를 동기화하기 위한 IPC(Inter Process Communication)
H/W (Disk, NIC …)
OS
File System, Memory
Socket API
I/OProcessThread
IPC
OS 와 Network Program 과의 관계
© 2009 IBM Corporation
Software Architecture – File on the Unix
• What is the File on the Unix?• Process 가 열고 있는 file 확인
– Process 가 생성되면 기본적으로 Open 하는 File• 0 : 표준입력• 1 : 표준출력• 2 : 표준오류
office2@root/proc/9804/fd>ls -altotal 120dr-x------ 1 root system 0 Sep 27 03:22 .dr-xr-xr-x 1 root system 0 Sep 27 03:22 ..lr-xr-xr-x 24 root system 1024 Sep 22 18:48 0 -> /lr-xr-xr-x 24 root system 1024 Sep 22 18:48 1 -> /lr-xr-xr-x 24 root system 1024 Sep 22 18:48 2 -> /--w--w---- 1 root system 12506 Sep 15 18:13 7--w--w---- 1 root system 12506 Sep 15 18:13 8--w--w---- 1 root system 12506 Sep 15 18:13 9
© 2009 IBM Corporation
Software Architecture
• Application Programs and OS– Type of Software (Conceptual Model)
© 2009 IBM Corporation
Software Architecture
• Application Programs and OS– Application Programs
© 2009 IBM Corporation
Software Architecture
• Application Programs and OS– Operating Systems
© 2009 IBM Corporation
Software Architecture
• Application Programs and OS– Device Drivers
© 2009 IBM Corporation
Software Architecture
• Application Programs and OS– AIX 5L Structure
© 2009 IBM Corporation
Application Architecture
• Multi-Process Model– Process : Program 이 실행될 때 생성되는 Program 을 대표하는
제어흐름과 System 자원 (memory,file,IPC…) 등을 의미– Process 생성 및 제어
• fork()– Process 복사본 생성– 자신과 코드를 공유하는 Child Process 생성
• exec()– 현재 Process 에 Program 의 실행 이미지를 변경– 새로운 Program 을 Load 해서 실행
Init Process Process’
Process A
Process’’
Process B
fork() fork()
exec()exec()
Multi-processing 으로 인한IPC 는 kernel overhead 를
증가시킴
© 2009 IBM Corporation
Application Architecture
• Multi-Processing Model– Socket Program
socket
bind
listen
accept
read
write
socket
connect
write
read
close
Server Client
연결요청
데이터 요청
데이터 수신
fork()
© 2009 IBM Corporation
Application Architecture
• Multi-Processing Model
Server App
Server AppServer App
Server AppServer App
Client App
Client App
Client App
Client App…
Server AppServer App
Server AppServer App
Client App
Client App
Client App
Client App
fork() ① connecting Client to Server ② fork()
요청이 있을 때마다 fork() 가 일어난다 .
1
2
process Pool ① fork() ② connecting Client to Server
fork() 시간이 오래 걸리므로 pool 에 미리 fork() 를 해서 child processes 를 만들어 놓는다 .
12
© 2009 IBM Corporation
Application Architecture
• IPC (Inter Process Communication)– What is IPC?
• Process 간에 data 를 공유하고 동기화하기 위해 사용하는 방법
– IPC 종류• Semaphore
– 세마포어는 프로세스간 데이타를 동기화하고 보호• Shared Memory
– 다중프로세스들이 가상메모리를 공유 , 메모리 공유를 위한 가장 빠른 수단• Message Queues
– queue 는 자료구조의 한종류인데 , 먼저 들어온 자료가 먼저 나가는 구조– 메시지큐의 IPC 로써의 특징은 다른 공유방식에 비해서 사용방법이 매우 직관적이고
간단– 제어하기가 상당히 까다롭다 .
© 2009 IBM Corporation
Application Architecture
• IPC– IPC 종류
• Pipe– 프로세스의 데이타를 다른 프로세스에게 넘기기 위한 목적으로 사용 . 데이타는
한쪽방향으로만 흐를수 있으며 ( 읽거나 쓸수만 있고 , 동시에 읽고 쓰기를 할수는 없다 .- Read only or Write only), 동일한 부모를 (PPID 가 같은 ) 가지는 process 사이에서만 사용이가능 하다
• FIFO (Named Pipe)– 연속처리 I/O STREAM 선입선출로 Pipe 와 비슷하나 이름을 부여해 서로다른
Process 사이의 사용이 가능한것이 Pipe 와 다른점– mknod 를 이용하여 FIFO 를 생성
• UDS (Unix Domain Socket)– socket API 를 수정없이 이용가능하며 , port 기반의 Internet Domain Socket 에
비해서 로컬 시스템의 파일시스템을 이용해서 내부프로세스간의 통신을 위해 사용한다 .
© 2009 IBM Corporation
Application Architecture
• IPC– IPC commands
– lpcs comnand• ipcs -m ( shared memory )• ipcs -q ( message gueues )• ipcs -s ( semaphore )
– lpcrm comnand• 세마포어 , 메세지큐 , 공유메모리부분을 시스템에서 제거
기능 메세지큐 세마포어 공유메모리
1.IPC 할당방법 msgget semget shmget
2.IPC 제어방법 msgctl semctl shmctl
( 상태변경 , 해제 )
3.IPC 작동방법 msgsnd semop shmat
(send/receive) msgrcv shmdt
© 2009 IBM Corporation
Application Architecture
• IPC– IPC Limits
Semaphores 4.3.0 4.3.1 4.3.2 5.1 5.2 5.3
Maximum number of semaphore IDs for 32-bit kernel
4096 4096 131072 131072 131072 131072
Maximum number of semaphore IDs for 64-bit kernel
4096 4096 131072 131072 1310721048576
Maximum semaphores per semaphore ID 65535 65535 65535 65535 65535 65535
Maximum operations per semop call 1024 1024 1024 1024 1024 1024
Maximum undo entries per process 1024 1024 1024 1024 1024 1024
Size in bytes of undo structure 8208 8208 8208 8208 8208 8208
Semaphore maximum value 32767 32767 32767 32767 32767 32767
Adjust on exit maximum value 16384 16384 16384 16384 16384 16384
© 2009 IBM Corporation
Application Architecture
• IPC– IPC Limits
Message Queue 4.3.0 4.3.1 4.3.2 5.1 5.2 5.3
Maximum message size 4 MB 4 MB 4 MB 4 MB 4 MB 4 MB
Maximum bytes on queue 4 MB 4 MB 4 MB 4 MB 4 MB 4 MB
Maximum number of message queue IDs for 32-bit kernel
4096 4096 131072 131072 131072 131072
Maximum number of message queue IDs for 64-bit kernel
4096 4096 131072 131072 1310721048576
Maximum messages per queue ID 524288 524288 524288 524288 524288 524288
© 2009 IBM Corporation
Application Architecture
• IPC– IPC Limits
Shared Memory 4.3.0 4.3.1 4.3.2 5.1 5.2 5.3
Maximum segment size (32-bit process) 256 MB 2 GB 2 GB 2 GB 2 GB 2 GB
Maximum segment size (64-bit process) for 32-bit kernel
256 MB 2 GB 2 GB 64 GB 1 TB 1 TB
Maximum segment size (64-bit process) for 64-bit kernel
256 MB 2 GB 2 GB 64 GB 1 TB 32 TB
Minimum segment size 1 1 1 1 1 1
Maximum number of shared memory IDs (32-bit kernel)
4096 4096 131072 131072 131072 131072
Maximum number of shared memory IDs (64-bit kernel)
4096 4096 131072 131072 1310721048576
Maximum number of segments per process (32-bit process)
11 11 11 11 11 11
Maximum number of segments per process (64-bit process)
268435456
268435456
268435456
268435456
268435456
268435456
© 2009 IBM Corporation
Application Architecture
• IPC– IPC tunable parameters– msgmax
– msgmnb
Purpose: Specifies maximum message size.
Values: Dynamic with maximum value of 4 MB
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
Purpose: Specifies maximum number of bytes on queue.
Values: Dynamic with maximum value of 4 MB
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
© 2009 IBM Corporation
Application Architecture
• IPC– IPC tunable parameters– msgmni
– msgmnm
Purpose: Specifies maximum number of message queue IDs.
Values: Dynamic with maximum value of 131072
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
Purpose: Specifies maximum number of messages per queue.
Values: Dynamic with maximum value of 524288
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
© 2009 IBM Corporation
Application Architecture
• IPC– IPC tunable parameters– semaem
– semmni
Purpose: Specifies maximum value for adjustment on exit.
Values: Dynamic with maximum value of 16384
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
Purpose: Specifies maximum number of semaphore IDs.
Values: Dynamic with maximum value of 131072
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
© 2009 IBM Corporation
Application Architecture
• IPC– IPC tunable parameters– semmsl
– semopm
Purpose: Specifies maximum number of semaphores per ID.
Values: Dynamic with maximum value of 65535
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
Purpose: Specifies maximum number of operations per semop() call.
Values: Dynamic with maximum value of 1024
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
© 2009 IBM Corporation
Application Architecture
• IPC– IPC tunable parameters– semume
– semvmx
Purpose: Specifies maximum number of undo entries per process.
Values: Dynamic with maximum value of 1024
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
Purpose: Specifies maximum value of a semaphore.
Values: Dynamic with maximum value of 32767
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
© 2009 IBM Corporation
Application Architecture
• IPC– IPC tunable parameters– shmmax
– shmmin
Purpose: Specifies maximum shared memory segment size.
Values: Dynamic with maximum value of 256 MB for 32-bit processes and 0x80000000u for 64-bit
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
Purpose: Specifies minimum shared-memory-segment size.
Values: Dynamic with minimum value of 1
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
© 2009 IBM Corporation
Application Architecture
• IPC– IPC tunable parameters– shmmni
Purpose: Specifies maximum number of shared memory IDs.
Values: Dynamic with maximum value of 131072
Display: N/A
Change: N/A
Diagnosis:
N/A
Tuning: Does not require tuning because it is dynamically adjusted as needed by the kernel.
© 2009 IBM Corporation
Application Architecture
• Multi-Thread Model– Thread : Process 내에서 존재하는 제어 흐름– Socket Program
socket
bind
listen
accept
read
write
socket
connect
write
read
close
Server Client
연결요청
데이터 요청
데이터 수신
pthread_create()
© 2009 IBM Corporation
Application Architecture
• Multi-Thread Model
Server App
ThreadThread
ThreadThread
Client App
Client App
Client App
Client App…
ThreadThread
ThreadThread
Client App
Client App
Client App
Client App
pthread_create() ① connecting Client to Server ② pthread_create()
요청이 있을 때마다 pthread_create() 가 일어나지만 , fork() 보다는 훨씬 가볍다 .
1
Thread Pool ① pthread_create() ② connecting Client to Server
fork() 보다는 가볍지만 thread 생성시간 조차도 줄이기 위해 pool 을 사용 .
12
2
© 2009 IBM Corporation
Application Architecture
• N:N DB Connection (Multi-Process Model)
Oracle
Child processChild process
Child processChild process
Child processChild process.…
Process
Child processChild process
Child processChild process
Child processChild process.…
fork()
.…
fork()
DB Connection 은 n:n 으로 이루어지지만 Oracle 의 fork() 로 인해 system resource 를
낭비
Connection n:n
DB Query 의 가장 큰 load1.DB Connect (from network)2.DB Query 해석
© 2009 IBM Corporation
Application Architecture
• 1:1 DB Connection (Multi-Process Model)
Oracle
Child process
Process
Child processChild process
Child processChild process
Child processChild process.…
fork()fork()
DB Connection 은 1:1 로 oracle 의 fork()는 1 회로 제한되어 system resource
낭비가 적지만 client 의 연결이 원활하지 않을 수 있음
Connection 1:1
DB Query 의 가장 큰 load1.DB Connect (from network)2.DB Query 해석
© 2009 IBM Corporation
Application Architecture
• DB Connection Pool (Multi-Process Model)– Thread Pool or Process Pool
Oracle
Child process
Process
Child processChild process
Child processChild process
Child processChild process.…
fork()
Connection n:n
Thread
Thread
Thread
Thread
Child processChild process
Child process
Pool 내의 미리 맺어놓은 Connection 으로 처리 , Pool 의 자원을 빌려주는 형태로 , 부족할 때 Pool 의 자원을
유동적으로 할당 가능
Connection Pool
fork()
DB Query 의 가장 큰 load1.DB Connect (from network)2.DB Query 해석
© 2009 IBM Corporation
Application Architecture
• DB Connection Pool (Multi-Thread Model)
• Multi Treading Model (①, ⑤)• Thread Pool Model for DB Connection (①, ②, ③, ⑥)
Server App
Thread
Client App
Client App
Client App
Client App…
ThreadThread
ThreadThread
1 2
Oracle
Child processChild process
Child processChild process
54 3
6
ThreadThread
Thread
Pre-Process Model (Process Pool)Pre-Thread Model (Thread Pool)
© 2009 IBM Corporation
Application Architecture
• I/O Multiplexing Model– Socket 이 각자의 socket I/O 를 이용하여 통신하지 않고 하나의 socket
I/O 를 통해서 통신하는 방법으로 Socket 을 file descriptor table 에 등록한 후 file descriptor table 의 I/O 를 감시해서 다중 접속을 처리
– select / poll
Server Client연결요청
File descriptor 지정
Server Clientdata 송수신
File descriptor 감시
Server Client연결종료
File descriptor 해제
© 2009 IBM Corporation
Application Architecture
• I/O Multiplexing Model
– 단점
• I/O Multiplexing 을 위해 selec / poll 을 이용하는데 넓은 범위의 file descriptor array 중에 어떤 file descriptor 에서 event 가 발생하였는지 일일이 loop 를 돌며 확인해야 함
지정한 File descriptorFile descriptor table
모든 File descriptor 를 검사해야 함
I/O Multiplexing Model 의 단점
© 2009 IBM Corporation
Application Architecture
• Event based I/O Model through Real-time Signal
– Event 기반의 socket 처리 방식
• UNIX/Linux : POSIX Real-time Signal, epoll
• Windows, AIX, iSeries OS : IOCP
• FreedBSD : kqueue (kernel queue)
© 2009 IBM Corporation
Application Architecture
• Event based I/O Model through Real-time Signal– Real-time Signal
• 대기열이 존재하지 않는 Signal 의 단점과 이로인해 아무런 정보다 전달되지 않는 단점을 보완
• Real-time Signal 은 대기열이 존재하며 , 대기열의 크기만큼 event 를 저장할 수 있어 signal 의 손실을 피할 수 있다 .
• 또한 , real-time signal 을 발생시킨 socket 의 descriptor 등의 정보 전달이 가능하여 , 부가적인 정보를 저장할 수 있다 .
• select / poll 과 같이 file descriptor table 의 descriptor array 를 뒤지지 않아도 된다 .
Socket1 Thread1Client1Client1
Socket2 Thread2Client2Client2
Socket3 Thread3Client3Client3
SIGRTMIN+1
SIGRTMIN+2
SIGRTMIN+3
Thread-pool 을 이용하여 Real-time signal 을 thread 와 함께 사용
© 2009 IBM Corporation
Applicatoin Architecture
• epoll– epoll : event poll– Real-time Signal 보다 약 10% ~ 20% 의 성능 향상– HP-UX, Redhat 지원 , AIX 미지원– Event poll 에 넣고 관리하기 때문에 read/write event 가 발생하면 관련
정보를 return 해줌 . Return 되는 정보는 descriptor 와 같은 정보로 poll과 같은 loop 를 통해 확인할 필요가 없다 .
Socket1Socket1
Socket2Socket2
Socket3Socket3
Event poll
File descriptor
© 2009 IBM Corporation
Application Architecture
dphttpd symmetric multiprocessor result
• epoll– httpd test result
dphttpd uniprocessor result
© 2009 IBM Corporation
Application Architecture
• epoll– Pipetest
Pipetest symmetric multiprocessor result Pipetest uniprocessor result
© 2009 IBM Corporation
Application Architecture
• epoll– Dead connecton test
128bytes context ,Dead connections test result 1024ytes context ,Dead connections test result
© 2009 IBM Corporation
Application Architecture
• IOCP (I/O Completion Ports)– IOCP on iSeries
• AS/400 부터 지원 , i 는 1988년 AS/400 으로 시작 , AS/400, OS/400, i5/OS, i6/OS 로 발전
• AS/400 QMU 5.0.1.02 introduces asynchronous I/O completion ports (IOCP)
– IOCP on Windows NT• Windows NT Winsock2 부터 지원
– IOCP on AIX• I/O completion port support was first introduced in AIX 4.3 by APAR
IY06351. An I/O completion port was originally a Windows NT scheduling construct that has since been implemented in other OS's. Domino uses these constructs to improve the scalability of the server. It allows one thread to handle multiple session requests, so that a Notes client session is no longer bound to a single thread for its duration. The completion port is tied directly to a device handle and any network I/O requests that are made to that handle.
© 2009 IBM Corporation
Application Architecture
• Parallel Programming– Fundamental of Parallel Programming
• Multi-Process/Multi-Thread• Asynchronous Procedure Calls• Signal, Event• Queuing Asynchronous Procedure Calls• IOCP
Ex) File Finder Agent
© 2009 IBM Corporation
Application Architecture
• Parallel Programming – OpenMP(Open Multi-Processing)– An Application Program Interface (API) that may be used to
explicitly direct multi-threaded, shared memory parallelism– Comprised of three primary API components
• Compiler Directives• Runtime Library Routines• Environment Variables
– Portable– Standardized
© 2009 IBM Corporation
Application Architecture
• Parallel Programming - MPI– MPI (Message Passing Interface)
• Message Passing Parallel Programming 을 위한 Standard Data Communication Library
• References– http://www.mcs.anl.gov/mpi/index.html– http://www.mpi-forum.org/docs/docs.html
– MPI 목표• 이식성 (portability)• 효율성 (efficiency)• 기능성 (functionality)
© 2009 IBM Corporation
Application Architecture
• Parallel Programming– MPI 기본 개념
• Process 기준으로 작업 할당• Processor : Process = 1:1 or 1:N• Message = data + envelope
– 어떤 process 가 보내는가 ?– 어디에 있는 data 를 보내는가 ?– 어떤 data 를 보내는가 ?– 얼마나 보내는가 ?– 어떤 process 가 받는가 ?– 어디에 저장할 것인가 ?– 얼마나 받을 준비를 해야 하는가 ?
• Tag– Message matching 과 구분에 이용– 순서대로 메시지 도착을 처리할 수 있음– 와일드 카드 사용 가능
• Communicator– 서로간에 통신이 허용되는 프로세스들의 집합
© 2009 IBM Corporation
Application Architecture
• Parallel Programming– MPI 기본 개념
• Process Rank– 동일한 communicator 내의 process 들을 식별하기 위한 식별자
• Point to Point Communication– 두 개 process 사이의 통신– 하나의 송신 process 에 하나의 수신 process 가 대응
• Collective communication– 동시에 여러 개의 process 가 참여– 1:N, N:1, N:N 대응 가능– 여러 번의 P2P Communication 사용을 하나의 Collective Communication 으로
대체» 오류 가능성 적음 , 최적화로 빠름
© 2009 IBM Corporation
Application Architecture
• Java– Development and execution of Java applications
© 2009 IBM Corporation
Application Architecture
• Java– Java application 을 이용하여 System 을 효율적으로 사용하는 방법
• NIO (New I/O)• NIO pollset• Garbage collector 는 자동으로 collect 하도록 나둘 것• 특별한 이유가 없으면 JRE 는 최신으로 update 할 것• 개발시 source code 는 최신으로 유지할 것 (Deprecated 로 명시된 API 는
되도록 다른 API 로 변경하여 사용 )• Framework 을 사용한다면 framework 을 최신으로 유지할 것
반드시 개선됨
JRE 나 Framework 로 인해개선이 안될 수도 있음
© 2009 IBM Corporation
Application Architecture
• Java– pollset
• Java Source code– DatagramChannel channel = DatagramChannel.open();– Channel = configureBlocking(false);– Selector selector = Selector.open();– Channel.register(selector, SelectKey.OP_READ);– Channel.register(selector, SelectKey.OP_READ);– int poll(struct pollfd fds[], nfds_t nfds, int timeout);
• Native pollset interface C source code– pollset_t ps = pollset_create(int maxfd);– int rc = pollset_destory(pollset_t ps);– int rc = pollset_ctl(pollset_t ps, struct poll_ctl *pollctl_array, int array_length);– int nfound = pollset_poll(pollset_t ps, struct pollfd *polldata_array, int
array_length,
int timeout);
© 2009 IBM Corporation
Application Architecture
• Java– pollset
• Traditional poll method
© 2009 IBM Corporation
Application Architecture
• Java– pollset
• pollset method
© 2009 IBM Corporation
Application Architecture
• Java– pollset
• pollcache internal– pollcache control block
© 2009 IBM Corporation
Application Architecture
• Java– pollset
• pollset() – bulky update
© 2009 IBM Corporation
Application Architecture
• Java– pollset
• The throughput performance two drivers(with poll() and with pollset())– pollset driver 가이 poll driver 보다 13.3% 성능 향상
© 2009 IBM Corporation
Application Architecture
• Java– pollset
• Time spent on CPU
© 2009 IBM Corporation
AIX I/O Model
• select / poll
• pollset
• event
• Real-time Signal
• AIO
• IOCP
© 2009 IBM Corporation
AIX IOCP
• IOCP
– I/O completion port support was first introduced in AIX 4.3 by APAR IY06351. An I/O completion port was originally a Windows NT scheduling construct that has since been implemented in other OS's.
© 2009 IBM Corporation
AIX IOCP
• IOCP– Synchronous I/O versus asynchronous I/O
© 2009 IBM Corporation
AIX IOCP
• IOCP– IOCP Operation
© 2009 IBM Corporation
AIX IOCP
• IOCP– CreateIoCompletionPort Function
< IOCP on AIX >
#include <iocp.h>int CreateIoCompletionPort (FileDescriptor, CompletionPort, CompletionKey, ConcurrentThreads)HANDLE FileDescriptor, CompletionPort;DWORD CompletionKey, ConcurrentThreads;
< IOCP on Windows >
HANDLE CreateIoCompletionPort (HANDLE FileHandle, // handle to file (socket)HANDLE ExistingCompletionPort, // handle to I/O completion portULONG_PTR CompletionKey, // completion keyDWORD NumberOfConcurrentThreads // number of threads to execute concurrently);
© 2009 IBM Corporation
AIX IOCP
• IOCP– How to configure IOCP on AIX
• fileset : bos.iocp.rte
$ lslpp -l bos.iocp.rteThe output from the lslpp command should be similar to the following : Fileset Level State Description ----------------------------------------------------------------------------
Path: /usr/lib/objrepos bos.iocp.rte 5.3.9.0 APPLIED I/O Completion Ports API
Path: /etc/objrepos bos.iocp.rte 5.3.0.50 COMMITTED I/O Completion Ports API
office2@root/>lsdev -Cciocpiocp0 Available I/O Completion Ports
office2@root/>lsattr -Eliocp0autoconfig available STATE to be configured at system restart True
© 2009 IBM Corporation
AIX IOCP
• IOCP– How to configure IOCP on AIX
© 2009 IBM Corporation
AIX IOCP
• IOCP– How to configure IOCP on AIX
© 2009 IBM Corporation
AIX IOCP
• IOCP– How to configure IOCP on AIX
© 2009 IBM Corporation
AIX IOCP
• IOCP– How to configure IOCP on AIX
© 2009 IBM Corporation
AIX IOCP
• IOCP– How to configure IOCP on AIX
© 2009 IBM Corporation
AIX IOCP
• IOCP API– CreateCompletionPort – GetMultipleCompletionStatus – GetQueuedCompletionStatus – PostQueuedCompletionStatus – ReadFile – WriteFile
© 2009 IBM Corporation
iSeries IOCP
• IOCP API– QsoStartAccept– QsoCreateIOCompletionPort– QsoDestroyIOCompletionPort– QsoPostIOCompletion– QsoStartRecv– QsoStartSend– QsoCancelOperation– QsoWaitForIOCompletion
© 2009 IBM Corporation
Windows IOCP
• IOCP API– CreateIoCompletionPort– GetQueuedCompletionStatus– GetQueuedCompletionStatusEx– PostQueuedCompletionStatus– ReadFileEx– WriteFileEx– Kernel Functions
• NtCreateIoCompletion, NtRemoveIoCompletion• KeInitializeQueue, KeRemoveQueue• KeInsertQueue• KeWaitForSingleObject• KeDelayExecutionThread• KiActivateWaiterQueue• KiUnwaitThread• NtSetIoCompletion
© 2009 IBM Corporation
Q & A