[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화

All-Flash Ceph 구성과 최적화

Feb. 18, 2016

SDS Tech. Lab, Corporate R&D Center

SK Telecom

OpenStack Days in Korea

1

Why are we focusing at all-flash Ceph?

Tech. Trends of Storage Systems

Hybrid Scale-up Storage

Systems

Hybrid Scale-out

Storage Systems

All-flash Scale-up Storage

Systems

All-flash Scale-out

Storage Systems

Effective

Capacity

Increasing

Performance Up

Requirements for All-IT Network/Infra Storage System

Scalability Availability Performance

2

What is Ceph?

http://docs.ceph.com/docs/master/_images/stack.png

Object Virtual Disk Files & Dirs Object

App App Host/VM Client

• Ceph is a unified, distributed, massively scalable open source storage solution

Object, Block and File storage • mostly LGPL open source project

• Failure is normal

• Self managing

• Scale out on commodity hardware

• Everything runs in software

3

Ceph Architecture

OSD

Cluster Maps

Direct IO between clients and OSDs

Service Network

Storage Network

Ceph Storage System

OSD OSD

KVM

librbd

Application

krbd

Application

librados

Monitor Monitor Monitor

OSD

Cluster Maps

4

Ceph Operation: Ceph Block Device

PG#0

OSD #1

Disk or RAID Group

XFS

Journal

1. O_DIRECT 2. O_DSYNC

2. Buffered I/O

OSD #0

Synchronous Replication

FileStore

PG#1 PG#2 PG#3

Data

PG#2

librbd librados

OSD Service

OSD #0

Application

데이터 배치 : CRUSH 알고리즘

Ceph Block Device 고정 크기 (기본: 4MB) Object의 연속 예) 1GB Block Image = 256 Objects

Hash: Object to PG

5

Ceph OSD 노드 구성

Journal / Data 디스크 구성

• 일반적인 조합 (Journal / Data) SSD / HDD 외부 저널 디스크 없음 / SSD PCIe SSD / SATA SSD

NVRAM / SSD

0

5

10

15

20

25

30

35

0

20000

40000

60000

80000

100000

SSD NVRAM

ms

IOP

S

4KB Random Write

IOPS Latency

노드 별 OSD 개수

• OSD: 1 OSD per (1 DISK or 1 RAID GROUP)

• Ceph OSD Daemon CPU-intensive processes

0

20

40

60

80

100

120

0

5000

10000

15000

20000

25000

30000

35000

3 OSDs 4 OSDs 6 OSDs 8 OSDs 12 OSDs

ms

IOP

S

4KB Random Write

IOPS Latency

Journal Type

6

Ceph on All-Flash Performance Issues

6945 11585 14196 15885 15298 16603

2.9 3.5 5.7

10.1

21.9

39.7

0

5

10

15

20

25

30

35

40

45

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

4 8 16 32 64 128

ms

IOP

S

Thread count

4KB Random Write IOPS IOPS Latency

3716 4910 7243 13694

80062 102090 5.4

8.1

11.0 11.7

5.9

10.9

0

2

4

6

8

10

12

14

0

50,000

100,000

150,000

200,000

250,000

300,000

4 8 16 32 64 128

ms

IOP

S

Thread count

4KB Random Read IOPS IOPS Latency Issue: Low Throughput & High

Latency • SSD Spec. 4KB Random Read: up to 95K IOPS 4KB Random Write: up to 85K IOPS (Sustained 14K IOPS) Latency < 1ms • Ideal Throughput 4KB Random Read: 95K * 10EA * 4

Node 3800K IOPS 4KB Random Write: 14K * 10EA * 4

Node / 2 Replication 280K IOPS

Sustained Performance: 2x Write(80% of Usable Space) 5 Clients use krbd

측정치 Ideal

4KB Random Read

102K IOPS 3800K IOPS

4KB Random Write

17K IOPS 280K IOPS

7

Ceph IO Flow in OSD

ObjectStore

JournalingObjectStore KeyValueStore MemStore BlueStore

FileStore

PGBackend

ReplicatedBackend ECBackend

PG

1. Journal: LIBAIO (O_DIRECT && O_DSYNC) Committed 2. Data: Buffered IO and syncfs() later Applied

OSD

Messenger

8

Ceph Write IO Flow: Receiving Request

OSD

Secondary OSD

5. Send Rep Operations to 2nd OSD

Operation Threads

L: ReplicatedBackend

FileStore

L: Messenger

Operation Threads

FileStore

2. Receive Write Req.

4. Do Operation

3. Queue Op.

Queue Rep Op.

Operation WQ

Operation WQ

L: Messenger

Client

Cluster Network

Public Network

1. Send Write Req.

6. Enqueue Transaction to FileStore

PG Lock

PG Unlock

PG Lock

PG Unlock

9

Ceph Write IO Flow: in File Store

File Store (Data)

File Journal

Operation Threads

Writer Thread

Committed

Data Disk Journal

Disk

Operation Threads

1.Queue Transaction

2. Operate Journal

Transactions

5. Queue op

6. Queue to Finisher

Finisher Thread

writeq

Operation WQ

3. Write to Journal Disk

Write Finisher Thread

4. AIO Complete

Journal and Data

completion?

Finisher Thread

7. Write Data

8. Queue to Finisher

Applied

PG Lock

PG Lock

PG Unlock

Send RepOp Reply to Primary if this is secondary OSD

10

최적화

항목 이슈

PG Lock 전체 Latency 중 30% 이상

이 PG Lock을 얻는데 소모

• OP Processing Worker Thread가 Block되어 관련 없는 OP의 처리가 늦어짐

• 큰 PG Lock의 Critical Section • Secondary OSD의 ACK 처리가 늦어져 IO Latency 증가

Ceph & System Tuning 성능 측정 도중 결과값의 기

복이 큼

• Ceph 설정 변수: 개별 변경은 효과가 없고 최적 조합이 필요 • Memory Allocator의 CPU 사용량이 높음 • TCP/IP Nagle 알고리즘

Log Log 비활성화 여부에 따라

성능 변화가 큼

• OSD의 I/O 처리 과정에서 Log로 인한 시간 소모

Transaction Transaction 처리가 성능에

큰 영향

• Transaction 처리 비효율: 불필요한 연산, Lock Contention

11

VM 성능: 실험 환경

Service Network (10GbE)

Storage Network (10GbE)

Physical Client (x 5)

Vender / Model DELL R720XD

Processor Intel® Xeon® E5-2670v3 @ 2.60GHz x 2 (10core)

Memory 128GB

OS CentOS 7.0

OSD Node/Monitor (x 4)

Vender / Model DELL R630

Processor Intel® Xeon® E5-2690v3 @ 2.60GHz x 2 (12core)

Memory 128GB NIC 10Gbe

OS CentOS 7.0 JOURNAL RAMDISK

Switch (x 2)

Vender / Model Cisco nexus 5548UP 10G

Disk

SSD SK Hynix SSD 480GB 10개 / OSD Node

RAID RAID 0, SSD 3개, 3개, 2개, 2개 (4 RAID Group) - Device(4개) & Daemon(4개) / OSD Node

Ceph

Version SKT Ceph와 Community(0.94.4)

VM (x Physical Client 당 최대 4개)

Guest OS Spec 2 Core, 4 GB memory

librbd

FIO Test Configuration

Run Time 300 Ramp Time 10

Threads 8 Queue Depth 8

Sustained Performance: 2x Write(80% of Usable Space)

12

VM 성능 비교 : Random Workload

71

3

43

3

185

114 118

71 3.4

5.7

2.7

5.5

3.5 3.4

2.0 2.5

0

1

2

3

4

5

6

7

8

9

10

0

20

40

60

80

100

120

140

160

180

200

4K

B R

W S

KT

CEP

H

4K

B R

WC

om

mu

nit

y

32

KB

RW

SK

TC

EPH

32

KB

RW

Co

mm

un

ity

4K

B R

R S

KT

CEP

H

4K

B R

RC

om

mu

nit

y

4K

B R

R S

KT

CEP

H

4K

B R

RC

om

mu

nit

y

ms

KIO

PS

■ SKT CEPH IOPS ■ Community IOPS ◆ Latency

13

VM 성능 비교 : Sequential Workload

2,669 2,729 2,768 2,948

4,287 4,281 4,281 4,296

59.7 28.3

172.4

425.2

73.2

36.7

293.6

292.7

0

50

100

150

200

250

300

350

400

450

0

500

1,000

1,500

2,000

2,500

3,000

3,500

4,000

4,500

5,000

1M

B S

W S

KT

CEP

H

1M

B S

WC

om

mu

nit

y

4M

B S

W S

KT

CEP

H

4M

B S

WC

om

mu

nit

y

1M

B S

R S

KT

CEP

H

1M

B S

RC

om

mu

nit

y

4M

B S

R S

KT

CEP

H

4M

B S

RC

om

mu

nit

y

ms

MB

/s

■ SKT CEPH BW ■ Community BW ◆ Latency

14

SKT AF-Ceph

AFC-S: 4 Data Node + 1 Management Node (Commodity Server & SSD 기반)

Monitor Node (관리 서버)

Data Node (OSD Node)

NVRAM

Journal

SATA SSD 10ea

Data Store

System Configuration

구성 4 Data Node + 1 Monitor Node

상면 5U

SSD SATA SSD 40 ea (in 4U)

NVRAM 8GB NVRAM

용량 Total 40TB / Usable 20TB (w/ 1TB SSD) Total 80TB / Usable 40TB (w/ 2TB SSD)

Node H/W

CPU Intel Xeon E5 2690v3 2-socket

RAM 128GB (DDR3 1866MHz)

Network 10GbE x 2 for Service & Storage

…

AFC-N: 2U MicroServer (4 Data Node) + 1U NVMe All-Flash JBOF

…

NV-Array (All-Flash

JBOF)

NV-Drive (NVMe SSD)

E5 2-socket Server

(4 Nodes in 2U) • 고성능(PCIe 3.0)

• 고집적(2.5” NVMe SSD 24EA: Up to 96TB)

• ‘16. 4Q 예정

15

SKT AF-Ceph

Real Time Monitoring

Multi Dashboard

Rule Base Alarm

Drag & Drop Admin

Rest API

Real-time Graph

Graph Merge

Drag & Zooming

Auto Configuration

Cluster Management

RBD Management

Object Storage Management

16

End-of-Document

연락처.

엄주관, [email protected]

Technology

[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화