Upload
openstack-korea-community
View
16.823
Download
7
Embed Size (px)
Citation preview
All-Flash Ceph 구성과 최적화
Feb. 18, 2016
SDS Tech. Lab, Corporate R&D Center
SK Telecom
OpenStack Days in Korea
1
Why are we focusing at all-flash Ceph?
Tech. Trends of Storage Systems
Hybrid Scale-up Storage
Systems
Hybrid Scale-out
Storage Systems
All-flash Scale-up Storage
Systems
All-flash Scale-out
Storage Systems
Effective
Capacity
Increasing
Performance Up
Requirements for All-IT Network/Infra Storage System
Scalability Availability Performance
2
What is Ceph?
http://docs.ceph.com/docs/master/_images/stack.png
Object Virtual Disk Files & Dirs Object
App App Host/VM Client
• Ceph is a unified, distributed, massively scalable open source storage solution
Object, Block and File storage • mostly LGPL open source project
• Failure is normal
• Self managing
• Scale out on commodity hardware
• Everything runs in software
3
Ceph Architecture
OSD
Cluster Maps
Direct IO between clients and OSDs
Service Network
Storage Network
Ceph Storage System
OSD OSD
KVM
librbd
Application
krbd
Application
librados
Monitor Monitor Monitor
OSD
Cluster Maps
4
Ceph Operation: Ceph Block Device
PG#0
OSD #1
Disk or RAID Group
XFS
Journal
1. O_DIRECT 2. O_DSYNC
2. Buffered I/O
OSD #0
Synchronous Replication
FileStore
PG#1 PG#2 PG#3
Data
PG#2
librbd librados
OSD Service
OSD #0
Application
데이터 배치 : CRUSH 알고리즘
Ceph Block Device 고정 크기 (기본: 4MB) Object의 연속 예) 1GB Block Image = 256 Objects
Hash: Object to PG
5
Ceph OSD 노드 구성
Journal / Data 디스크 구성
• 일반적인 조합 (Journal / Data) SSD / HDD 외부 저널 디스크 없음 / SSD PCIe SSD / SATA SSD
NVRAM / SSD
0
5
10
15
20
25
30
35
0
20000
40000
60000
80000
100000
SSD NVRAM
ms
IOP
S
4KB Random Write
IOPS Latency
노드 별 OSD 개수
• OSD: 1 OSD per (1 DISK or 1 RAID GROUP)
• Ceph OSD Daemon CPU-intensive processes
0
20
40
60
80
100
120
0
5000
10000
15000
20000
25000
30000
35000
3 OSDs 4 OSDs 6 OSDs 8 OSDs 12 OSDs
ms
IOP
S
4KB Random Write
IOPS Latency
Journal Type
6
Ceph on All-Flash Performance Issues
6945 11585 14196 15885 15298 16603
2.9 3.5 5.7
10.1
21.9
39.7
0
5
10
15
20
25
30
35
40
45
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
4 8 16 32 64 128
ms
IOP
S
Thread count
4KB Random Write IOPS IOPS Latency
3716 4910 7243 13694
80062 102090 5.4
8.1
11.0 11.7
5.9
10.9
0
2
4
6
8
10
12
14
0
50,000
100,000
150,000
200,000
250,000
300,000
4 8 16 32 64 128
ms
IOP
S
Thread count
4KB Random Read IOPS IOPS Latency Issue: Low Throughput & High
Latency • SSD Spec. 4KB Random Read: up to 95K IOPS 4KB Random Write: up to 85K IOPS (Sustained 14K IOPS) Latency < 1ms • Ideal Throughput 4KB Random Read: 95K * 10EA * 4
Node 3800K IOPS 4KB Random Write: 14K * 10EA * 4
Node / 2 Replication 280K IOPS
Sustained Performance: 2x Write(80% of Usable Space) 5 Clients use krbd
측정치 Ideal
4KB Random Read
102K IOPS 3800K IOPS
4KB Random Write
17K IOPS 280K IOPS
7
Ceph IO Flow in OSD
ObjectStore
JournalingObjectStore KeyValueStore MemStore BlueStore
FileStore
PGBackend
ReplicatedBackend ECBackend
PG
1. Journal: LIBAIO (O_DIRECT && O_DSYNC) Committed 2. Data: Buffered IO and syncfs() later Applied
OSD
Messenger
8
Ceph Write IO Flow: Receiving Request
OSD
Secondary OSD
5. Send Rep Operations to 2nd OSD
Operation Threads
L: ReplicatedBackend
FileStore
L: Messenger
Operation Threads
FileStore
2. Receive Write Req.
4. Do Operation
3. Queue Op.
Queue Rep Op.
Operation WQ
Operation WQ
L: Messenger
Client
Cluster Network
Public Network
1. Send Write Req.
6. Enqueue Transaction to FileStore
PG Lock
PG Unlock
PG Lock
PG Unlock
9
Ceph Write IO Flow: in File Store
File Store (Data)
File Journal
Operation Threads
Writer Thread
Committed
Data Disk Journal
Disk
Operation Threads
1.Queue Transaction
2. Operate Journal
Transactions
5. Queue op
6. Queue to Finisher
Finisher Thread
writeq
Operation WQ
3. Write to Journal Disk
Write Finisher Thread
4. AIO Complete
Journal and Data
completion?
Finisher Thread
7. Write Data
8. Queue to Finisher
Applied
PG Lock
PG Lock
PG Unlock
Send RepOp Reply to Primary if this is secondary OSD
10
최적화
항목 이슈
PG Lock 전체 Latency 중 30% 이상
이 PG Lock을 얻는데 소모
• OP Processing Worker Thread가 Block되어 관련 없는 OP의 처리가 늦어짐
• 큰 PG Lock의 Critical Section • Secondary OSD의 ACK 처리가 늦어져 IO Latency 증가
Ceph & System Tuning 성능 측정 도중 결과값의 기
복이 큼
• Ceph 설정 변수: 개별 변경은 효과가 없고 최적 조합이 필요 • Memory Allocator의 CPU 사용량이 높음 • TCP/IP Nagle 알고리즘
Log Log 비활성화 여부에 따라
성능 변화가 큼
• OSD의 I/O 처리 과정에서 Log로 인한 시간 소모
Transaction Transaction 처리가 성능에
큰 영향
• Transaction 처리 비효율: 불필요한 연산, Lock Contention
11
VM 성능: 실험 환경
Service Network (10GbE)
Storage Network (10GbE)
Physical Client (x 5)
Vender / Model DELL R720XD
Processor Intel® Xeon® E5-2670v3 @ 2.60GHz x 2 (10core)
Memory 128GB
OS CentOS 7.0
OSD Node/Monitor (x 4)
Vender / Model DELL R630
Processor Intel® Xeon® E5-2690v3 @ 2.60GHz x 2 (12core)
Memory 128GB NIC 10Gbe
OS CentOS 7.0 JOURNAL RAMDISK
Switch (x 2)
Vender / Model Cisco nexus 5548UP 10G
Disk
SSD SK Hynix SSD 480GB 10개 / OSD Node
RAID RAID 0, SSD 3개, 3개, 2개, 2개 (4 RAID Group) - Device(4개) & Daemon(4개) / OSD Node
Ceph
Version SKT Ceph와 Community(0.94.4)
VM (x Physical Client 당 최대 4개)
Guest OS Spec 2 Core, 4 GB memory
librbd
FIO Test Configuration
Run Time 300 Ramp Time 10
Threads 8 Queue Depth 8
Sustained Performance: 2x Write(80% of Usable Space)
12
VM 성능 비교 : Random Workload
71
3
43
3
185
114 118
71 3.4
5.7
2.7
5.5
3.5 3.4
2.0 2.5
0
1
2
3
4
5
6
7
8
9
10
0
20
40
60
80
100
120
140
160
180
200
4K
B R
W S
KT
CEP
H
4K
B R
WC
om
mu
nit
y
32
KB
RW
SK
TC
EPH
32
KB
RW
Co
mm
un
ity
4K
B R
R S
KT
CEP
H
4K
B R
RC
om
mu
nit
y
4K
B R
R S
KT
CEP
H
4K
B R
RC
om
mu
nit
y
ms
KIO
PS
■ SKT CEPH IOPS ■ Community IOPS ◆ Latency
13
VM 성능 비교 : Sequential Workload
2,669 2,729 2,768 2,948
4,287 4,281 4,281 4,296
59.7 28.3
172.4
425.2
73.2
36.7
293.6
292.7
0
50
100
150
200
250
300
350
400
450
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
5,000
1M
B S
W S
KT
CEP
H
1M
B S
WC
om
mu
nit
y
4M
B S
W S
KT
CEP
H
4M
B S
WC
om
mu
nit
y
1M
B S
R S
KT
CEP
H
1M
B S
RC
om
mu
nit
y
4M
B S
R S
KT
CEP
H
4M
B S
RC
om
mu
nit
y
ms
MB
/s
■ SKT CEPH BW ■ Community BW ◆ Latency
14
SKT AF-Ceph
AFC-S: 4 Data Node + 1 Management Node (Commodity Server & SSD 기반)
Monitor Node (관리 서버)
Data Node (OSD Node)
NVRAM
Journal
SATA SSD 10ea
Data Store
System Configuration
구성 4 Data Node + 1 Monitor Node
상면 5U
SSD SATA SSD 40 ea (in 4U)
NVRAM 8GB NVRAM
용량 Total 40TB / Usable 20TB (w/ 1TB SSD) Total 80TB / Usable 40TB (w/ 2TB SSD)
Node H/W
CPU Intel Xeon E5 2690v3 2-socket
RAM 128GB (DDR3 1866MHz)
Network 10GbE x 2 for Service & Storage
…
AFC-N: 2U MicroServer (4 Data Node) + 1U NVMe All-Flash JBOF
…
NV-Array (All-Flash
JBOF)
NV-Drive (NVMe SSD)
E5 2-socket Server
(4 Nodes in 2U) • 고성능(PCIe 3.0)
• 고집적(2.5” NVMe SSD 24EA: Up to 96TB)
• ‘16. 4Q 예정
15
SKT AF-Ceph
Real Time Monitoring
Multi Dashboard
Rule Base Alarm
Drag & Drop Admin
Rest API
Real-time Graph
Graph Merge
Drag & Zooming
Auto Configuration
Cluster Management
RBD Management
Object Storage Management