Upload
marian-marinov
View
517
Download
0
Embed Size (px)
Citation preview
Backing up thousands of containers
ORHow to fail miserably at
copying data
OpenFest 2015
Talk about backup systems...Why?
➢First backup system built in 1999
➢Since then, 10 different systems
➢But why built your own?➢ simple: SCALE
➢I'm very proud of the design of the last two systems my team and I build
Backup considerations
➢Storage capacity
➢Amount of backup copies
➢HDD and RAID speeds
➢Almost never the network
Networking....➢typical transfer speed over 1Gbit/s ~ 24MB/s
➢typical transfer speed over 10Gbit/s ~ 110MB/s
➢Restoring a 80% full 2TB drive ➢~21h over 1Gbit/s with 24MB/s
➢~4h and a half over 10Gbit/s with 110MB/s
➢Overlapping backups on the same network equipment
➢Overlapping backups and restores
➢Switch uplinks
Architecture of container backups
➢Designed for 100,000 containers
➢backup each container at least once a day
➢30 incremental copies
➢Now I'll explain HOW :)
Host machine architecture
➢We use LVM
➢RAID array which exposes a single drive
➢setup a single Physical Volume on that drive
➢setup a single Volume Group using the above PV
➢Thin provisioned VG
➢Each container with its own Logical Volume
Backup node architecture➢Again we use LVM
➢RAID array which exposes a single drive
➢5 equally big Physical Volumes
➢on each PV we create a VG with thin pool
➢each container has a single LV
➢each incremental backup is a new snapshot from the LV
➢when the max number of incremental backups is reached, we remove the first LV
For now, there is nothing really For now, there is nothing really new or very interesting here.new or very interesting here.
So let me start with the fun So let me start with the fun part.part.
➢We use rsync (nothing revolutionary here)
➢We need the size of the deleted files➢https://github.com/kyupltd/rsync/tree/deleted-stats
➢Restore files directly in client's containers, no SSH into them➢https://github.com/kyupltd/rsync/tree/mount-ns
Backup system architecture
➢ One central database➢Public/Private IP addresses
➢Maximum slots per machine
➢ Gearman for messaging layer
➢ Scheduler for backups
➢ Backup worker
The Scheduler
➢ Check if we have to backup the container
➢ Get the last backup timestamp
➢ Check if the host node has available backup slots
➢ Schedule a 'start-backup' job at the gearman on the backup node
start-backup worker
➢ Works on each backup node
➢ Started as many times as the Backup server can handle
➢ handles the actual backup➢ creates snapshots
➢ monitors rsync
➢ remove snapshots
➢ update database
No problems... they say :)
➢ We lost ALL of our backups from TWO node➢ corrupted VG metadata
➢ VG metadata is not enough (more then 2000) LVs ➢ create the VGs a little bit smaller then the total size
of the PV
➢ separate the VGs to loose less
No problems... they say :)➢ LV creation becomes sluggish because LVM tries to
scan for devices in /dev➢ obtain_device_list_from_udev = 1
➢ write_cache_state = 0
➢ specify the devices in scan = [ “/dev” ]
➢lvmetad and dmetad break...➢ when they breack, they corrupt the metadata of all currently
opened containers
➢lvcreate leaks file descriptors➢ once lvmetad or dmeventd are out of FDs everything breaks
Then the Avatar came
➢ We wanted to reduce the restore time from 4h to under 1h, even under 30min
➢ So instead of backing up whole containers...
➢ We now backup accounts
➢ Soon we will be able to do distributed restore➢ single host node backup
➢ from multiple backup nodes
➢ to multiple host nodes
Layerd backupsSparse File
Physical Volume
Volume Group
ThinPool
Logical Volume
Snapshot6
Snapshot5
Snapshot4
Snapshot3
Snapshot2
Snapshot1
Snapshot0
Loop mount
Issues here
➢ We can't keep a machine UP for more then 19 hours, LVM kernel BUG➢2.6 till 4.3 - when discarding data it crashes
➢ Removing old snapshots does not discard the data
➢ LVM umounts a volume when dmeventd reaches the limit of Fds➢ It does umount -l, the bastard
Issues here
➢ LVM dmeventd try's to extend the volume, but if you don't have free extents it will silently umount -l your LV
➢ Monitor your thinpool metadata
➢ Make your thinpool smaller then the VG and always plan to have a few spare PE for extending the pool
➢ kabbi__ irc.freenode.net #lvm
Any Questions?