Upload
kao-kuo-tung
View
73
Download
6
Embed Size (px)
Citation preview
ceph 介紹果凍
簡介
● 任職於 inwinstack○ 過去的迎廣科技雲端應用研發中
心
● python, django, linux, openstack, docker
● kjellytw at gmail dot com● http://www.blackwhite.
tw/
大綱
● 儲存系統是什麼● 儲存系統的演進● ceph 介紹● python 如何用 ceph● ceph 的一些指令介紹● yahoo 的 ceph 架構介紹
儲存系統是什麼?
● 容量有限,資料無窮● 一顆硬碟不夠,就在多一顆硬碟● 儲存系統,是管理很多資料的系統
Storage System
● Feature:○ Replication○ High capacity○ Consistency
● Optional feature:○ Over Allocation○ Snapshot○ Dereplication
儲存系統的演進 - 1. 自己的硬碟自己用
HostDISK
DISK
DISK
磁碟控制器
檔案系統 目錄層
JBOD Host
DISK
DISK
DISK
磁碟控制器
檔案系統 目錄層
儲存系統的演進 - 2
Host
DISK
DISK
DISK
磁碟控制器 orSCSI 控制器等
檔案系統 目錄層LUN FC protocal or iscsi
Host
儲存系統的演進 - 3
NAS
DISK
DISK
DISK
磁碟控制器 orSCSI 控制器等
檔案系統LUN FC protocal or iscsi
目錄層
What’s Next?
● storage cluster:○ capacity○ performance
● stroage cluster 例子:○ lbrix○ Panfs○ ceph
Storage Cluster - 1Client
Controller
Storage node
1. Client send data to controller to store data.
2. Controller store the data to storage.
Storage node
Storage Cluster - 2
Client Controller1. Get where the data sotre
2. Client store the data to stroage directly.
Storage node
Storage node
Storage Cluster - 3
Clientmonitor
Storage
1. Get the cluster information.
3. Cient store the data to stroa“ge directly.
2. Client compute the position where the data should put based on cluster information
Storage node
Ceph
● One of clustered storage● Software defined storage
○ Cost-performance tradeoff○ Flexible interfaces○ Different storage abstractions
Ceph
● A distributed object store and file system designed to provide excellent performance, reliability and scalability.
● Open source and freely-available, and it always will be.
● Object Storage (rados)● Block Storage (rbd)● File System (cephfs)
● Object Storage:○ You can get/put object using key based on the
interface Object Storage provides.○ example: S3
● Block Storage:○ Block storage provide you virtual disk.
The virtual disk is just like real disk.● File System:
○ Just like nas
Ceph Motivating Principles
● Everything must scale horizontally ● No single point of failure ● Commodity hardware ● Self-manage whenever possible
Ceph Architectural Features
● Object locations get computed.○ CRUSH algorithm○ Ceph OSD Daemon uses it to compute where
replicas of objects should be stored (and for rebalancing).
○ Ceph Clients use the CRUSH algorithm to efficiently compute information about object location
● No centralized interface.● OSDs Service Clients Directly
Ceph Component
RADOS
● A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters
● Component:○ ceph-mon○ ceph-osd○ librados
● data hierarchy:○ pool○ object
RADOS
● ceph-mon:○ maintaining a master copy of the cluster map○ These do not serve stored objects to clients
RADOS
● ceph-osd:○ Storing objects on a local file system ○ Providing access to objects over the network to
client directly.○ Use CRUSH algorithm to get where the object is.○ One per disk(RAID group)○ Peering○ Checks its own state and the state of other OSDs
and reports back to monitors.
RADOS
● librados:○ Client retrieve the latest copy of the Cluster Map,
so it knows about all of the monitors, OSDs, and metadata servers in the cluster
○ Client use CRUSH and Cluster Map to get where the object is.
○ Directly access to OSD.
CRUSH
● Controlled Replication Under Scalable Hashing
● Hash algorithm● Generate position based on:
○ pg(Placement Group)○ cluster map○ rule set
What’s the Problem CRUSH Solved?
How to Decide Where the Object Put?
● Way 1: look up the table.○ Easy to implement○ Hard to scale horizontally
key1 node1
key2 node1
key3 node2
key4 node1
How to Decide Where the Object Put?
● Way 2: hash:○ Easy to implement○ But too many data movement when rebanlance
A: 0~33
B: 33~66
C: 66~99
A: 0~25
B: 25~50
D: 50~75
C: 75~50
Add new node
How to Decide Where the Object Put?
● Way 3: hash with static table:○ look up table
after hashing○ openstack swift
hash
data1 data2 data3
node1
data4
node2 node3
Virtual partition1
Virtual partition2
Virtual partition3
Virtual partition4
data5
map
How to Decide Where the Object Put?
● Way 4: CRUSH○ Fast calculation, no lookup○ Repeatable, deterministic○ Statistically uniform distribution○ Stable mapping○ Rule-based configuration
Why We Need Placement Group
● A layer of indirection between the Ceph OSD Daemon and the Ceph Client.○ Decoupling OSD and client.○ Easy to rebanlance.
Placement Group
Pool Pool
Placement Group Placement Group Placement Group
OSD OSD OSD OSD
Object Object Object
How to Compute Object Location
object id: foopool: bar
hash(“foo”) % 256 = 0x23
“bar” => 3
Placement Group: 3.23
OSDMap.h
How to Compute Object Location
Placement Group: 3.23
CRUSH OSD
OSD
OSD
How does the Client Access Object
Client monitor
Storage
1. get the cluster information.
3. client store the data to stroage directly.
2. Client compute the position where the data should put based on cluster information
Storage node
How does the Client Write Object
How does the Client Read Object
● Read from primary osd● Send reads to all replicas. The quickest
reply would "win" and others would be ignored.
Client
Primary OSD
ClientOSD
OSDOSD
Rados Gateway
● HTTP REST gateway for the RADOS object store
● Rich api○ S3 api○ swift api
● Integrate with openstack keystone.● stateless
radowgw with s3 client
radosgw with swift client
Why We Need radowgw?
● We want to use RESTFul api○ S3 api○ swift api
● We don’t want other to know the cluster status.
RBD
● RADOS Block Devices○ Provide virtual disk just like real disk.
● Image strip (by librbd)● integrate with linux kernel, kvm
Why RBD need to Strip Image?
● Avoid big file.● Parallelism● Random access
0~2500 2500~5000 5000~7500 7500~10000
01XXXX OSD1 OSD3 OSD4 OSD6
02XXXX OSD8 OSD2 OSD3 OSD5
03XXXX OSD1 OSD6 OSD2 OSD3
librbd
OSD1
OSD4
OSD3
librados Support Stripping?
● No, librados doesn’t support stripping.● But you can use libradosstriper
○ Poor document.
Openstack use rbd
● You can use rbd with nova, glance, or cinder.
● Cinder use rbd to provide volumes.● Glance use rbd to store image.● Why glance use rbd instead of libradow or
radowgw?○ Copy-on-write
librados(python)
librados(python)
librbd(python)
librbd(python)
rados command
● rados lspools● rados mkpool my_pool● rados create obj_name -p my● rados put file_path obj_name -p my● rados get obj_name file_path -p my
rados command
● rados getxattr obj_name attr_name● rados setxattr obj_name attr_name value● rados rmxattr obj_name attr_name
rados command
● rados lssnap -p pool_name● rados mksnap snap_name -p pool_name● rados rmsnap snap_name -p pool_name● rados obj_name snap_name -p pool_name
rados command
● rados import backup_dir pool_name● rados export pool_name backup_dir
rbd command
● rbd create --size 1000 volume -p pools● rbd map volume -p pools
○ /dev/rbd*● rbd unmap /dev/rbd0● rbd import file_path image_name● rbd export image_name file_path
rbd command
● snap ls <image-name> ● snap create <snap-name> ● snap rollback <snap-name> ● snap rm <snap-name> ● snap purge <image-name> ● snap protect <snap-name> ● snap unprotect <snap-name>
Yahoo Ceph Cluster Architecture
● COS contains many ceph culster○ limit OSD number.
● There are many gateway behind of load banlancer.
Summary
● 儲存系統演進● 介紹 ceph● CRUSH 和其他方法的比較
徵人
● Interesting in○ openstack○ ceph
Q & A
Thank you