59
ceph 介紹 果凍

Intorduce to Ceph

Embed Size (px)

Citation preview

Page 1: Intorduce to Ceph

ceph 介紹果凍

Page 2: Intorduce to Ceph

簡介

● 任職於 inwinstack○ 過去的迎廣科技雲端應用研發中

● python, django, linux, openstack, docker

● kjellytw at gmail dot com● http://www.blackwhite.

tw/

Page 3: Intorduce to Ceph

大綱

● 儲存系統是什麼● 儲存系統的演進● ceph 介紹● python 如何用 ceph● ceph 的一些指令介紹● yahoo 的 ceph 架構介紹

Page 4: Intorduce to Ceph

儲存系統是什麼?

● 容量有限,資料無窮● 一顆硬碟不夠,就在多一顆硬碟● 儲存系統,是管理很多資料的系統

Page 5: Intorduce to Ceph

Storage System

● Feature:○ Replication○ High capacity○ Consistency

● Optional feature:○ Over Allocation○ Snapshot○ Dereplication

Page 6: Intorduce to Ceph

儲存系統的演進 - 1. 自己的硬碟自己用

HostDISK

DISK

DISK

磁碟控制器

檔案系統 目錄層

JBOD Host

DISK

DISK

DISK

磁碟控制器

檔案系統 目錄層

Page 7: Intorduce to Ceph

儲存系統的演進 - 2

Host

DISK

DISK

DISK

磁碟控制器 orSCSI 控制器等

檔案系統 目錄層LUN FC protocal or iscsi

Page 8: Intorduce to Ceph

Host

儲存系統的演進 - 3

NAS

DISK

DISK

DISK

磁碟控制器 orSCSI 控制器等

檔案系統LUN FC protocal or iscsi

目錄層

Page 9: Intorduce to Ceph

What’s Next?

● storage cluster:○ capacity○ performance

● stroage cluster 例子:○ lbrix○ Panfs○ ceph

Page 10: Intorduce to Ceph

Storage Cluster - 1Client

Controller

Storage node

1. Client send data to controller to store data.

2. Controller store the data to storage.

Storage node

Page 11: Intorduce to Ceph

Storage Cluster - 2

Client Controller1. Get where the data sotre

2. Client store the data to stroage directly.

Storage node

Storage node

Page 12: Intorduce to Ceph

Storage Cluster - 3

Clientmonitor

Storage

1. Get the cluster information.

3. Cient store the data to stroa“ge directly.

2. Client compute the position where the data should put based on cluster information

Storage node

Page 13: Intorduce to Ceph

Ceph

● One of clustered storage● Software defined storage

○ Cost-performance tradeoff○ Flexible interfaces○ Different storage abstractions

Page 14: Intorduce to Ceph

Ceph

● A distributed object store and file system designed to provide excellent performance, reliability and scalability.

● Open source and freely-available, and it always will be.

● Object Storage (rados)● Block Storage (rbd)● File System (cephfs)

Page 15: Intorduce to Ceph

● Object Storage:○ You can get/put object using key based on the

interface Object Storage provides.○ example: S3

● Block Storage:○ Block storage provide you virtual disk.

The virtual disk is just like real disk.● File System:

○ Just like nas

Page 16: Intorduce to Ceph

Ceph Motivating Principles

● Everything must scale horizontally ● No single point of failure ● Commodity hardware ● Self-manage whenever possible

Page 17: Intorduce to Ceph

Ceph Architectural Features

● Object locations get computed.○ CRUSH algorithm○ Ceph OSD Daemon uses it to compute where

replicas of objects should be stored (and for rebalancing).

○ Ceph Clients use the CRUSH algorithm to efficiently compute information about object location

● No centralized interface.● OSDs Service Clients Directly

Page 18: Intorduce to Ceph

Ceph Component

Page 19: Intorduce to Ceph

RADOS

● A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters

● Component:○ ceph-mon○ ceph-osd○ librados

● data hierarchy:○ pool○ object

Page 20: Intorduce to Ceph

RADOS

● ceph-mon:○ maintaining a master copy of the cluster map○ These do not serve stored objects to clients

Page 21: Intorduce to Ceph

RADOS

● ceph-osd:○ Storing objects on a local file system ○ Providing access to objects over the network to

client directly.○ Use CRUSH algorithm to get where the object is.○ One per disk(RAID group)○ Peering○ Checks its own state and the state of other OSDs

and reports back to monitors.

Page 22: Intorduce to Ceph

RADOS

● librados:○ Client retrieve the latest copy of the Cluster Map,

so it knows about all of the monitors, OSDs, and metadata servers in the cluster

○ Client use CRUSH and Cluster Map to get where the object is.

○ Directly access to OSD.

Page 23: Intorduce to Ceph

CRUSH

● Controlled Replication Under Scalable Hashing

● Hash algorithm● Generate position based on:

○ pg(Placement Group)○ cluster map○ rule set

Page 24: Intorduce to Ceph

What’s the Problem CRUSH Solved?

Page 25: Intorduce to Ceph

How to Decide Where the Object Put?

● Way 1: look up the table.○ Easy to implement○ Hard to scale horizontally

key1 node1

key2 node1

key3 node2

key4 node1

Page 26: Intorduce to Ceph

How to Decide Where the Object Put?

● Way 2: hash:○ Easy to implement○ But too many data movement when rebanlance

A: 0~33

B: 33~66

C: 66~99

A: 0~25

B: 25~50

D: 50~75

C: 75~50

Add new node

Page 27: Intorduce to Ceph
Page 28: Intorduce to Ceph

How to Decide Where the Object Put?

● Way 3: hash with static table:○ look up table

after hashing○ openstack swift

hash

data1 data2 data3

node1

data4

node2 node3

Virtual partition1

Virtual partition2

Virtual partition3

Virtual partition4

data5

map

Page 29: Intorduce to Ceph

How to Decide Where the Object Put?

● Way 4: CRUSH○ Fast calculation, no lookup○ Repeatable, deterministic○ Statistically uniform distribution○ Stable mapping○ Rule-based configuration

Page 30: Intorduce to Ceph

Why We Need Placement Group

● A layer of indirection between the Ceph OSD Daemon and the Ceph Client.○ Decoupling OSD and client.○ Easy to rebanlance.

Page 31: Intorduce to Ceph

Placement Group

Pool Pool

Placement Group Placement Group Placement Group

OSD OSD OSD OSD

Object Object Object

Page 32: Intorduce to Ceph

How to Compute Object Location

object id: foopool: bar

hash(“foo”) % 256 = 0x23

“bar” => 3

Placement Group: 3.23

OSDMap.h

Page 33: Intorduce to Ceph

How to Compute Object Location

Placement Group: 3.23

CRUSH OSD

OSD

OSD

Page 34: Intorduce to Ceph

How does the Client Access Object

Client monitor

Storage

1. get the cluster information.

3. client store the data to stroage directly.

2. Client compute the position where the data should put based on cluster information

Storage node

Page 35: Intorduce to Ceph

How does the Client Write Object

Page 36: Intorduce to Ceph

How does the Client Read Object

● Read from primary osd● Send reads to all replicas. The quickest

reply would "win" and others would be ignored.

Client

Primary OSD

ClientOSD

OSDOSD

Page 37: Intorduce to Ceph

Rados Gateway

● HTTP REST gateway for the RADOS object store

● Rich api○ S3 api○ swift api

● Integrate with openstack keystone.● stateless

Page 38: Intorduce to Ceph

radowgw with s3 client

Page 39: Intorduce to Ceph

radosgw with swift client

Page 40: Intorduce to Ceph

Why We Need radowgw?

● We want to use RESTFul api○ S3 api○ swift api

● We don’t want other to know the cluster status.

Page 41: Intorduce to Ceph

RBD

● RADOS Block Devices○ Provide virtual disk just like real disk.

● Image strip (by librbd)● integrate with linux kernel, kvm

Page 42: Intorduce to Ceph

Why RBD need to Strip Image?

● Avoid big file.● Parallelism● Random access

0~2500 2500~5000 5000~7500 7500~10000

01XXXX OSD1 OSD3 OSD4 OSD6

02XXXX OSD8 OSD2 OSD3 OSD5

03XXXX OSD1 OSD6 OSD2 OSD3

librbd

OSD1

OSD4

OSD3

Page 43: Intorduce to Ceph

librados Support Stripping?

● No, librados doesn’t support stripping.● But you can use libradosstriper

○ Poor document.

Page 44: Intorduce to Ceph

Openstack use rbd

● You can use rbd with nova, glance, or cinder.

● Cinder use rbd to provide volumes.● Glance use rbd to store image.● Why glance use rbd instead of libradow or

radowgw?○ Copy-on-write

Page 45: Intorduce to Ceph

librados(python)

Page 46: Intorduce to Ceph

librados(python)

Page 47: Intorduce to Ceph

librbd(python)

Page 48: Intorduce to Ceph

librbd(python)

Page 49: Intorduce to Ceph

rados command

● rados lspools● rados mkpool my_pool● rados create obj_name -p my● rados put file_path obj_name -p my● rados get obj_name file_path -p my

Page 50: Intorduce to Ceph

rados command

● rados getxattr obj_name attr_name● rados setxattr obj_name attr_name value● rados rmxattr obj_name attr_name

Page 51: Intorduce to Ceph

rados command

● rados lssnap -p pool_name● rados mksnap snap_name -p pool_name● rados rmsnap snap_name -p pool_name● rados obj_name snap_name -p pool_name

Page 52: Intorduce to Ceph

rados command

● rados import backup_dir pool_name● rados export pool_name backup_dir

Page 53: Intorduce to Ceph

rbd command

● rbd create --size 1000 volume -p pools● rbd map volume -p pools

○ /dev/rbd*● rbd unmap /dev/rbd0● rbd import file_path image_name● rbd export image_name file_path

Page 54: Intorduce to Ceph

rbd command

● snap ls <image-name> ● snap create <snap-name> ● snap rollback <snap-name> ● snap rm <snap-name> ● snap purge <image-name> ● snap protect <snap-name> ● snap unprotect <snap-name>

Page 55: Intorduce to Ceph

Yahoo Ceph Cluster Architecture

● COS contains many ceph culster○ limit OSD number.

● There are many gateway behind of load banlancer.

Page 56: Intorduce to Ceph

Summary

● 儲存系統演進● 介紹 ceph● CRUSH 和其他方法的比較

Page 57: Intorduce to Ceph

徵人

● Interesting in○ openstack○ ceph

Page 58: Intorduce to Ceph

Q & A

Page 59: Intorduce to Ceph

Thank you