Backing up thousands of containers

Backing up thousands of containers

ORHow to fail miserably at

copying data

OpenFest 2015

Talk about backup systems...Why?

➢First backup system built in 1999

➢Since then, 10 different systems

➢But why built your own?➢ simple: SCALE

➢I'm very proud of the design of the last two systems my team and I build

Backup considerations

➢Storage capacity

➢Amount of backup copies

➢HDD and RAID speeds

➢Almost never the network

Networking....➢typical transfer speed over 1Gbit/s ~ 24MB/s

➢typical transfer speed over 10Gbit/s ~ 110MB/s

➢Restoring a 80% full 2TB drive ➢~21h over 1Gbit/s with 24MB/s

➢~4h and a half over 10Gbit/s with 110MB/s

➢Overlapping backups on the same network equipment

➢Overlapping backups and restores

➢Switch uplinks

Architecture of container backups

➢Designed for 100,000 containers

➢backup each container at least once a day

➢30 incremental copies

➢Now I'll explain HOW :)

Host machine architecture

➢We use LVM

➢RAID array which exposes a single drive

➢setup a single Physical Volume on that drive

➢setup a single Volume Group using the above PV

➢Thin provisioned VG

➢Each container with its own Logical Volume

Backup node architecture➢Again we use LVM

➢RAID array which exposes a single drive

➢5 equally big Physical Volumes

➢on each PV we create a VG with thin pool

➢each container has a single LV

➢each incremental backup is a new snapshot from the LV

➢when the max number of incremental backups is reached, we remove the first LV

For now, there is nothing really For now, there is nothing really new or very interesting here.new or very interesting here.

So let me start with the fun So let me start with the fun part.part.

➢We use rsync (nothing revolutionary here)

➢We need the size of the deleted files➢https://github.com/kyupltd/rsync/tree/deleted-stats

➢Restore files directly in client's containers, no SSH into them➢https://github.com/kyupltd/rsync/tree/mount-ns

https://github.com/kyupltd/rsync/tree/deleted-stats

https://github.com/kyupltd/rsync/tree/mount-ns

Backup system architecture

➢ One central database➢Public/Private IP addresses

➢Maximum slots per machine

➢ Gearman for messaging layer

➢ Scheduler for backups

➢ Backup worker

The Scheduler

➢ Check if we have to backup the container

➢ Get the last backup timestamp

➢ Check if the host node has available backup slots

➢ Schedule a 'start-backup' job at the gearman on the backup node

start-backup worker

➢ Works on each backup node

➢ Started as many times as the Backup server can handle

➢ handles the actual backup➢ creates snapshots

➢ monitors rsync

➢ remove snapshots

➢ update database

No problems... they say :)

➢ We lost ALL of our backups from TWO node➢ corrupted VG metadata

➢ VG metadata is not enough (more then 2000) LVs ➢ create the VGs a little bit smaller then the total size

of the PV

➢ separate the VGs to loose less

No problems... they say :)➢ LV creation becomes sluggish because LVM tries to

scan for devices in /dev➢ obtain_device_list_from_udev = 1

➢ write_cache_state = 0

➢ specify the devices in scan = [ “/dev” ]

➢lvmetad and dmetad break...➢ when they breack, they corrupt the metadata of all currently

opened containers

➢lvcreate leaks file descriptors➢ once lvmetad or dmeventd are out of FDs everything breaks

Then the Avatar came

➢ We wanted to reduce the restore time from 4h to under 1h, even under 30min

➢ So instead of backing up whole containers...

➢ We now backup accounts

➢ Soon we will be able to do distributed restore➢ single host node backup

➢ from multiple backup nodes

➢ to multiple host nodes

Layerd backupsSparse File

Physical Volume

Volume Group

ThinPool

Logical Volume

Snapshot6

Snapshot5

Snapshot4

Snapshot3

Snapshot2

Snapshot1

Snapshot0

Loop mount

Issues here

➢ We can't keep a machine UP for more then 19 hours, LVM kernel BUG➢2.6 till 4.3 - when discarding data it crashes

➢ Removing old snapshots does not discard the data

➢ LVM umounts a volume when dmeventd reaches the limit of Fds➢ It does umount -l, the bastard

Issues here

➢ LVM dmeventd try's to extend the volume, but if you don't have free extents it will silently umount -l your LV

➢ Monitor your thinpool metadata

➢ Make your thinpool smaller then the VG and always plan to have a few spare PE for extending the pool

➢ kabbi__ irc.freenode.net #lvm

Any Questions?

Engineering

Backing up thousands of containers