37
. Best practices & Performance Tuning OpenStack Cloud Storage with Ceph OpenStack Summit Barcelona 25 th Oct 2015 @17:05 - 17:45 Room: 118-119

Ceph barcelona-v-1.2

Embed Size (px)

Citation preview

Page 1: Ceph barcelona-v-1.2

.

Best practices & Performance Tuning OpenStack Cloud Storage with Ceph

OpenStack Summit Barcelona25th Oct 2015 @17:05 - 17:45

Room: 118-119

Page 2: Ceph barcelona-v-1.2

Swami ReddyRJIL

Openstack & Ceph Dev

Pandiyan MRJIL

Openstack Dev

Who are we?

Page 3: Ceph barcelona-v-1.2

Agenda

• Ceph - Quick Overview

• OpenStack Ceph Integration

• OpenStack - Recommendations.

• Ceph - Recommendations.

• Q & A

• References

Page 4: Ceph barcelona-v-1.2

Cloud Environment Details

Cloud env with 200 nodes for general purpose use-cases. ~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage.• Average boot volume sizes

o Linux VMs - 20 GBo Windows VMs – 100 GB

• Average data Volume sizes: 200 GB

Compute (~160 nodes)• CPU : 2 * 16 @ 2.60 Hz• RAM : 256 GB• HDD : 3.6 TB (OS Drive)• NICs : 2 * 10 Gbps, 2 * 1 Gbps• Overprovision: CPU - 1:8

RAM - 1:1

Storage (~44 nodes)• CPU : 2 * 12 @ 2.50 Hz• RAM : 128 GB• HDD : 2 * 1 TB (OS Drive)• OSD : 22 * 3.6 TB• SSD : 2 * 800 GB (Intel S3700)• NICs : 2 * 10 Gbps , 2 * 1 Gbps• Replication : 3

Page 5: Ceph barcelona-v-1.2

Cloud Environment Details

Cloud env with 200 nodes for general purpose use-cases. ~2500 VMs - 40 TB RAM and 5120 cores -on 4 PB storage.• Average boot volume sizes

o Linux VMs - 20 GBo Windows VMs – 100 GB

• Average data Volume sizes: 200 GB

Compute (~160 nodes)• CPU : 2 * 16 @ 2.60 Hz• RAM : 256 GB• HDD : 3.6 TB (OS Drive)• NICs : 2 * 10 Gbps, 2 * 1 Gbps• Overprovision: CPU - 1:8

RAM - 1:1

Storage (~44 nodes)• CPU : 2 * 12 @ 2.50 Hz• RAM : 128 GB• HDD : 2 * 1 TB (OS Drive)• OSD : 22 * 3.6 TB• SSD : 2 * 800 GB (Intel S3700)• NICs : 2 * 10 Gbps , 2 * 1 Gbps• Replication : 3

Page 6: Ceph barcelona-v-1.2

Ceph - Quick Overview

Page 7: Ceph barcelona-v-1.2

Ceph Overview

Design Goals• Every component must scale• No single point of failure• Open source• Runs on commodity hardware• Everything must self-manage

Key Benefits• Multi-node striping and redundancy • COW cloning of images to volumes• Live migration of Ceph-backed VMs

Page 8: Ceph barcelona-v-1.2

OpenStack - Ceph Integration

Page 9: Ceph barcelona-v-1.2

OpenStack - Ceph Integration

CEPH STORAGE CLUSTER (RADOS)

CINDER GLANCE NOVA

RBD

HYPERVISOR (Qemu / KVM)

OPENSTACK

RGW

SWIFT

Page 10: Ceph barcelona-v-1.2

OpenStack - Ceph Integration

OpenStack Block storage - RBD flow:

• libvirt• QEMU• librbd• librados• OSDs and MONs

OpenStack Object storage - RGW flow:

• S3/SWIFT APIs• RGW • librados• OSDs and MONs

Openstack

libvirt

QEMU

librbd

librados

RADOS

Configures

S3 Compatible API Swift Compatible API

radosgw

librados

RADOS

Page 11: Ceph barcelona-v-1.2

OpenStack - Recommendations

Page 12: Ceph barcelona-v-1.2

Glance Recommendations

• What is Glance ?

• Configuration settings: /etc/glance/glance-api.conf• Use the ceph rbd as glance storage

• During the boot from volumes:• Disable local cache

• Expose Image URL helps saving time as image download and copy are NOT required

default_store=rbd

flavor = keystone+cachemanagement/flavor = keystone/

show_image_direct_url = Trueshow_multiple_locations = True

# glance --os-image-api-version 2 image-show 64b71b88-f243-4470-8918-d3531f461a26+------------------+-----------------------------------------------------------------+| Property | Value |+------------------+-----------------------------------------------------------------+| checksum | 24bc1b62a77389c083ac7812a08333f2 || container_format | bare || created_at | 2016-04-19T05:56:46Z || description | Image Updated on 18th April 2016 || direct_url | rbd://8a0021e6-3788-4cb3-8ada- || | 1f6a7b0d8d15/images/64b71b88-f243-4470-8918-d3531f461a26/snap || disk_format | raw |

Page 13: Ceph barcelona-v-1.2

Glance Recommendations

Image Format: Use ONLY RAW Images

With QCOW2 images:• Convert qcow2 to RAW image• Get the image UUID

With RAW images (No conversion; saves time):• Get the image UUID

Image Size (in GB) Format VM Boot time (Approx.)

50 (Windows) QCOW2 ~ 45 minutes

RAW ~ 1 minute

6 (Linux) QCOW2 ~ 2 minutes

RAW ~ 1 minute

Page 14: Ceph barcelona-v-1.2

Cinder Recommendations

• What is Cinder ?

• Configuration settings: /etc/glance/cinder.conf Enable Ceph as backend

• Cinder Backup Ceph supports Incremental backup

enabled_backend=ceph

backup_driver = cinder.backup.drivers.cephbackup_ceph_conf=/etc/ceph/ceph.confbackup_ceph_user = cinderbackup_ceph_chunk_size = 134217728backup_ceph_pool = backupsbackup_ceph_stripe_unit = 0backup_ceph_stripe_count = 0

Page 15: Ceph barcelona-v-1.2

Nova Recommendations

• What is Nova ?

• Configuration settings: /etc/nova/nova.conf

• Use librados (instead of krdb).

[libvirt]# enable discard support (be careful of perf)hw_disk_discard = unmap# disable password injectioninject_password = false# disable key injectioninject_key = false# disable partition injectioninject_partition = -2# make QEMU aware so caching worksdisk_cachemodes = "network=writeback"live_migration_flag="VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_LIVE,VIR_MIGRATE_PERSIST_DEST“

Page 16: Ceph barcelona-v-1.2

Ceph - Recommendations

Page 17: Ceph barcelona-v-1.2

Performance Decision Factors

• What is required storage (usable/RAW)?

• How many IOPS?• Aggregated • Per VM (min/max)

• Optimization for?• Performance• Cost

Page 18: Ceph barcelona-v-1.2

Ceph Cluster Optimization Criteria

Cluster Optimization Criteria Properties Sample Use Cases

IOPS - Optimized • Lowest cost per IOP • Highest IOPS • Meets minimum fault domain recommendation

• Typically block storage • 3x replication

Throughput - Optimized • Lowest cost per given unit of throughput • Highest throughput • Highest throughput per BTU • Highest throughput per watt • Meets minimum fault domain recommendation

• Block or object storage • 3x replication for higher read throughput

Capacity - Optimized • Lowest cost per TB • Lowest BTU per TB • Lowest watt per TB • Meets minimum fault domain recommendation

• Typically object storage Erasure coding common for maximizing usable capacity

Page 19: Ceph barcelona-v-1.2

OSD Considerations

• RAMo 1 GB of RAM per 1TB OSD space

• CPUo 0.5 CPU cores/1Ghz of a core per OSD (2 cores for SSD drives)

• Ceph-mons o 1 ceph-mon node per 15-20 OSD nodes

• Networko The sum of the total throughput of your OSD hard disks doesn’t exceed the network

bandwidth

• Thread counto High numbers of OSDs: (e.g., > 20) may spawn a lot of threads, during recovery and

rebalancing

HOSTOSD.2OSD.4OSD.6

OSD.1

OSD.3OSD.5

Page 20: Ceph barcelona-v-1.2

Ceph OSD Journal

• Run operating systems, OSD data and OSD journals on separate drives to maximize overall throughput.

• On-disk journals can halve write throughput .

• Use SSD journals for high write throughput workloads.

• Performance comparison with/without SSD journal using rados bench o 100% Write Operation with 4MB object size (default):

On-disk journal: 45 MB/s SSD journal: 80 MB/s

• Note: Above results with 1:11 SSD:OSD ratio

• Recommended to use 1 SSD with 4 - 6 OSDs for better results

Op Type No SDD SSD

Write (MB/s) 45 80

Seq Read (MB/s) 73 140

Rand Read (MB/s) 55 655

Page 21: Ceph barcelona-v-1.2

OS Considerations

• Kernel: Latest stable release• BIOS : Enable HT (hyperthreading) and VT(Virtualization Technology).• Kernel PID max:

• Read ahead: Set in all block devices

• Swappiness:

• Disable NUMA : Disabled by passing the numa_balancing=disable parameter to the kernel. • The same parameter could be controlled via the kernel.numa_balancing sysctl:

• CPU Tuning: Set “performance” mode use 100% CPU frequency always.

• I/O Scheduler:

# echo “4194303” > /proc/sys/kernel/pid_max

# echo "8192" > /sys/block/sda/queue/read_ahead_kb

# echo "vm.swappiness = 0" | tee -a /etc/sysctl.conf

# echo 0 > /proc/sys/kernel/numa_balancing

SATA/SAS Drives: # echo "deadline" > /sys/block/sd[x]/queue/scheduler SSD Drives : # echo "noop" > /sys/block/sd[x]/queue/scheduler

# echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Page 22: Ceph barcelona-v-1.2

Ceph Deployment Network

Page 23: Ceph barcelona-v-1.2

Ceph Deployment Network

• Each host have at least two 1Gbps network interface controllers (NICs).

• Use 10G Ethernet• Always use JUMBO frames

• High BW connectivity between TOR switches and spine routers, Example: 40Gbps to 100Gbps• Hardware should have a Baseboard Management Controller (BMC)

• Note: Running three networks in HA mode may seem like overkill

Public N/w

Cluster N/w

NIC-1

NIC-2

# ifconfig ethx mtu 9000 #echo "MTU=9000" | tee -a /etc/sysconfig/network-script/ifcfg-ethx

Page 24: Ceph barcelona-v-1.2

Ceph Deployment Network

• NIC Bonding - Balance-alb mode both NICs are used to send and receive traffics:

• Test results with 2x10G NIC:

• Active-Passive bond mode: Traffic between 2 nodes: Case#1 : node-1 to node-2 => BW 4.80 Gb/s Case#2: node-1 to node-2 => BW 4.62 Gb/s

• Speed of one 10Gig NIC.

• Balance-alb bond mode:

• Case#1 : node-1 to node-2 => BW 8.18 Gb/s• Case#2: node-1 to node-2 => BW 8.37 Gb/s

• Speed of two 10Gig NICs

Page 25: Ceph barcelona-v-1.2

Ceph Failure Domains

• A failure domain is any failure that prevents access to one or more OSDs. Added costs of isolating every potential failure domain.

Failure domains:• osd • host • chassis• rack • row • pdu • pod • room • datacenter • region

Page 26: Ceph barcelona-v-1.2

Ceph Ops Recommendations

Scrub and deep scrub operations are very IO consuming and can affect cluster performance.o Disable scrub and deep scrub

o After setting noscrub, nodeep-scrub ceph health became WARN state

o Enable Scrub and Deep Scrub

o Configure Scrub and Deep Scrub

#ceph osd set noscrub set noscrub#ceph osd set nodeep-scrubset nodeep-scrub

#ceph health HEALTH_WARN noscrub, nodeep-scrub flag(s) set

# ceph osd unset noscrubunset noscrub# ceph osd unset nodeep-scrubunset nodeep-scrub

osd_scrub_begin_hour = 0 # begin at this hourosd_scrub_end_hour = 24 # start last scrub atosd_scrub_load_threshold = 0.05 #scrub only below loadosd_scrub_min_interval = 86400 # not more often than 1 dayosd_scrub_max_interval = 604800 # not less often than 1 weekosd_deep_scrub_interval = 604800 # scrub deeply once a week

Page 27: Ceph barcelona-v-1.2

Ceph Ops Recommendations

• Decreasing recovery and backfilling performance impact

• Settings for recovery and backfilling :

Note: The above setting will slow down the recovery/backfill process and prolongs the recovery process, if we decrease the values. Increasing these settings value will increase recovery/backfill performance, but decrease client performance and vice versa

‘osd max backfills’ - maximum backfills allowed to/from a OSD [default 10]

‘osd recovery max active’ - Recovery requests per OSD at one time. [default 15]

‘osd recovery threads’ - The number of threads for recovering data. [default 1]

‘osd recovery op priority’ - Priority for recovery Ops. [ default 10]

Page 28: Ceph barcelona-v-1.2

Ceph Performance Measurement Guidelines

For best measurement results, follow these rules while testing:

• One option at a time.• Check - what is changing.• Choose the right performance test for the changed option.• Re-test the changes - at least ten times.• Run tests for hours, not seconds.• Trace for any errors.• Decisively look at results.• Always try to estimate results and see at standard difference to eliminate spikes and false tests.

Tuning:• Ceph clusters can be parametrized after deployment to better fit the requirements of the workload. • Some configuration options can affect data redundancy and have significant implications on stability and safety of data. • Tuning should be performed on test environment prior issuing any command and configuration changes on production.

Page 29: Ceph barcelona-v-1.2

Any questions?

Page 30: Ceph barcelona-v-1.2

Thank You

Swami Reddy | [email protected] | swamireddy @ irc

Satish | [email protected] | satish @ irc

Pandiyan M | [email protected] | maestropandy @ irc

Page 31: Ceph barcelona-v-1.2

Reference Links

• Ceph documentation • Previous Openstack summit presentations • Tech Talk Ceph • A few blogs on Ceph

• https://www.sebastien-han.fr/blog/categories/ceph/• https://

www.redhat.com/en/files/resources/en-rhst-cephstorage-supermicro-INC0270868_v2_0715.pdf

Page 32: Ceph barcelona-v-1.2

Appendix

Page 33: Ceph barcelona-v-1.2

Ceph H/W Best Practices

OSD

HOST

MDS

HOST

MON

HOST

1 x 64-Bit core1 x 32-Bit Dual Core1 x i386 Dual- Core

1GB per 1TB 1 x 64-Bit core1 x 32-Bit Dual Core1 x i386 Dual- Core

1GB per daemon 1GB per daemon1 x 64-Bit core1 x 32-Bit Dual Core1 x i386 Dual- Core

Page 34: Ceph barcelona-v-1.2

HDD, SDD, Controllers

• Ceph best practices to run operating systems, OSD data and OSD journals on separate drives.

Hard Disk Drives (HDD)• minimum hard disk drive size of 1 terabyte.• ~1GB of RAM for 1TB of storage space.NOTE: NOT a good idea to run:1. multiple OSDs on a single disk. 2. OSD/monitor/metadata server on a single disk.

Solid State Drives (SSD)Use SSDs to improve performance. NOTE:

Controllers Disk controllers also have a significant impact on write throughput.

Controller

Page 35: Ceph barcelona-v-1.2

Ceph OSD Journal - Results

Write operations

Page 36: Ceph barcelona-v-1.2

Ceph OSD Journal - Results

Seq Read operations

Page 37: Ceph barcelona-v-1.2

Ceph OSD Journal - Results

Read operations