Intorduce to Ceph

ceph 介紹果凍

簡介

● 任職於 inwinstack○ 過去的迎廣科技雲端應用研發中

心

● python, django, linux, openstack, docker

● kjellytw at gmail dot com● http://www.blackwhite.

tw/

http://www.blackwhite.tw/



大綱

● 儲存系統是什麼● 儲存系統的演進● ceph 介紹● python 如何用 ceph● ceph 的一些指令介紹● yahoo 的 ceph 架構介紹

儲存系統是什麼？

● 容量有限，資料無窮● 一顆硬碟不夠，就在多一顆硬碟● 儲存系統，是管理很多資料的系統

Storage System

● Feature:○ Replication○ High capacity○ Consistency

● Optional feature:○ Over Allocation○ Snapshot○ Dereplication

儲存系統的演進 - 1. 自己的硬碟自己用

HostDISK

DISK

DISK

磁碟控制器

檔案系統目錄層

JBOD Host

DISK

DISK

DISK

磁碟控制器

檔案系統目錄層

儲存系統的演進 - 2

Host

DISK

DISK

DISK

磁碟控制器 orSCSI 控制器等

檔案系統目錄層LUN FC protocal or iscsi

Host

儲存系統的演進 - 3

NAS

DISK

DISK

DISK

磁碟控制器 orSCSI 控制器等

檔案系統LUN FC protocal or iscsi

目錄層

What’s Next?

● storage cluster:○ capacity○ performance

● stroage cluster 例子：○ lbrix○ Panfs○ ceph

Storage Cluster - 1Client

Controller

Storage node

1. Client send data to controller to store data.

2. Controller store the data to storage.

Storage node

Storage Cluster - 2

Client Controller1. Get where the data sotre

2. Client store the data to stroage directly.

Storage node

Storage node

Storage Cluster - 3

Clientmonitor

Storage

1. Get the cluster information.

3. Cient store the data to stroa“ge directly.

2. Client compute the position where the data should put based on cluster information

Storage node

Ceph

● One of clustered storage● Software defined storage

○ Cost-performance tradeoff○ Flexible interfaces○ Different storage abstractions

Ceph

● A distributed object store and file system designed to provide excellent performance, reliability and scalability.

● Open source and freely-available, and it always will be.

● Object Storage (rados)● Block Storage (rbd)● File System (cephfs)

● Object Storage:○ You can get/put object using key based on the

interface Object Storage provides.○ example: S3

● Block Storage:○ Block storage provide you virtual disk.

The virtual disk is just like real disk.● File System:

○ Just like nas

Ceph Motivating Principles

● Everything must scale horizontally ● No single point of failure ● Commodity hardware ● Self-manage whenever possible

Ceph Architectural Features

● Object locations get computed.○ CRUSH algorithm○ Ceph OSD Daemon uses it to compute where

replicas of objects should be stored (and for rebalancing).

○ Ceph Clients use the CRUSH algorithm to efficiently compute information about object location

● No centralized interface.● OSDs Service Clients Directly

Ceph Component

RADOS

● A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters

● Component:○ ceph-mon○ ceph-osd○ librados

● data hierarchy:○ pool○ object

RADOS

● ceph-mon:○ maintaining a master copy of the cluster map○ These do not serve stored objects to clients

RADOS

● ceph-osd:○ Storing objects on a local file system ○ Providing access to objects over the network to

client directly.○ Use CRUSH algorithm to get where the object is.○ One per disk(RAID group)○ Peering○ Checks its own state and the state of other OSDs

and reports back to monitors.

RADOS

● librados:○ Client retrieve the latest copy of the Cluster Map,

so it knows about all of the monitors, OSDs, and metadata servers in the cluster

○ Client use CRUSH and Cluster Map to get where the object is.

○ Directly access to OSD.

CRUSH

● Controlled Replication Under Scalable Hashing

● Hash algorithm● Generate position based on:

○ pg（Placement Group）○ cluster map○ rule set

What’s the Problem CRUSH Solved?

How to Decide Where the Object Put?

● Way 1: look up the table.○ Easy to implement○ Hard to scale horizontally

key1 node1

key2 node1

key3 node2

key4 node1


● Way 2: hash:○ Easy to implement○ But too many data movement when rebanlance

A: 0~33

B: 33~66

C: 66~99

A: 0~25

B: 25~50

D: 50~75

C: 75~50

Add new node


● Way 3: hash with static table:○ look up table

after hashing○ openstack swift

hash

data1 data2 data3

node1

data4

node2 node3

Virtual partition1

Virtual partition2

Virtual partition3

Virtual partition4

data5

map


● Way 4: CRUSH○ Fast calculation, no lookup○ Repeatable, deterministic○ Statistically uniform distribution○ Stable mapping○ Rule-based configuration

Why We Need Placement Group

● A layer of indirection between the Ceph OSD Daemon and the Ceph Client.○ Decoupling OSD and client.○ Easy to rebanlance.

Placement Group

Pool Pool

Placement Group Placement Group Placement Group

OSD OSD OSD OSD

Object Object Object

How to Compute Object Location

object id: foopool: bar

hash(“foo”) % 256 = 0x23

“bar” => 3

Placement Group: 3.23

OSDMap.h

How to Compute Object Location

Placement Group: 3.23

CRUSH OSD

OSD

OSD

How does the Client Access Object

Client monitor

Storage

1. get the cluster information.

3. client store the data to stroage directly.

2. Client compute the position where the data should put based on cluster information

Storage node

How does the Client Write Object

How does the Client Read Object

● Read from primary osd● Send reads to all replicas. The quickest

reply would "win" and others would be ignored.

Client

Primary OSD

ClientOSD

OSDOSD

Rados Gateway

● HTTP REST gateway for the RADOS object store

● Rich api○ S3 api○ swift api

● Integrate with openstack keystone.● stateless

radowgw with s3 client

radosgw with swift client

Why We Need radowgw?

● We want to use RESTFul api○ S3 api○ swift api

● We don’t want other to know the cluster status.

RBD

● RADOS Block Devices○ Provide virtual disk just like real disk.

● Image strip (by librbd)● integrate with linux kernel, kvm

Why RBD need to Strip Image?

● Avoid big file.● Parallelism● Random access

0~2500 2500~5000 5000~7500 7500~10000

01XXXX OSD1 OSD3 OSD4 OSD6



librbd

OSD1

OSD4

OSD3

librados Support Stripping?

● No, librados doesn’t support stripping.● But you can use libradosstriper

○ Poor document.

Openstack use rbd

● You can use rbd with nova, glance, or cinder.

● Cinder use rbd to provide volumes.● Glance use rbd to store image.● Why glance use rbd instead of libradow or

radowgw?○ Copy-on-write

librados(python)

librados(python)

librbd(python)

librbd(python)

rados command

● rados lspools● rados mkpool my_pool● rados create obj_name -p my● rados put file_path obj_name -p my● rados get obj_name file_path -p my

rados command

● rados getxattr obj_name attr_name● rados setxattr obj_name attr_name value● rados rmxattr obj_name attr_name

rados command

● rados lssnap -p pool_name● rados mksnap snap_name -p pool_name● rados rmsnap snap_name -p pool_name● rados obj_name snap_name -p pool_name

rados command

● rados import backup_dir pool_name● rados export pool_name backup_dir

rbd command

● rbd create --size 1000 volume -p pools● rbd map volume -p pools

○ /dev/rbd*● rbd unmap /dev/rbd0● rbd import file_path image_name● rbd export image_name file_path

rbd command

● snap ls <image-name> ● snap create <snap-name> ● snap rollback <snap-name> ● snap rm <snap-name> ● snap purge <image-name> ● snap protect <snap-name> ● snap unprotect <snap-name>

Yahoo Ceph Cluster Architecture

● COS contains many ceph culster○ limit OSD number.

● There are many gateway behind of load banlancer.

Summary

● 儲存系統演進● 介紹 ceph● CRUSH 和其他方法的比較

徵人

● Interesting in○ openstack○ ceph

Q & A

Thank you