View
223
Download
3
Category
Preview:
Citation preview
Xen Virtualization at HuaweiUsages, Value-adds, and Challenges
Liu Jinsong, Chief Architect for Virtualization, Huaweiliu.jinsong@huawei.com
Usages• Huawei’s Unified Virtual Platform (UVP) supporting Huawei’s all-cloud
• public cloud
• private cloud
• NFV
• Per all-cloud, UVP enables dozens of features based on Xen/KVM• vm life-cycle management
• vm management
• Storage support -- vims, dsware, b-cache, NVMe SSD
• Network support – ovs + dpdk, smart NIC, vlan, vxlan
• live migration and DRS
• GPU graphics virtualization and virtual desktop
• GPGPU computing and FPGA virtualization
• ARM virtualization
Usages• vm life-cycle management
• Co-work with Huawei’s openstack
• Image and vm configuration (password, static ip, etc)
• vm start/ suspend/ resume/ reboot/ shutdown/ destroy
• vm snapshot
• vm sanity check
• HA/ FT
• vm management• CPU hotplug
• Memory over-commitment
• usb, cd-rom, vnc, scsi support
• pass-through/ SR-IOV
• Micro-vm
Usages• Storage
• NVMe SSD
• To satisfy extremely iops & latency requirements
• vims
• vSAN, OCFS2 enhancement supporing 64 nodes, used in Huawei’s private cloud
• dsware
• GFS/HDFS-like storage, supporting hundreds of nodes in 1 POD, used in public cloud
• b-cache
• read I/O cache, 10x speedup iops
• network• ovs + dpdk
• Smart NIC
• offloading via Huawei’s ARM + FPGA smartNIC
Usages• GPU graphics virtualization and virtual desktop
• 4 GPU graphics virtualization technologies
• Huawei used 2 GPU graphics virtualization, providing AWS G2-like instance
• API forwarding
• compatible issues for windows guest
• vGPU solution
• Nvidia K1/K2/M60 + XenGT
• GPU sr-iov is under evaluation
• Huawei’s desktop protocol
• HDP (Huawei Deskto Protocol) virtual desktop
Usages• GPGPU virtualization
• support HPC/ AI cloud, providing AWS P2-like instance
• GPU pass-through
• GPGPU capability is OK compared w/ native
• GPU – GPU data transfer bottleneck
• Nvidia GPUDirect (P2P) virualization
• FPGA virtualization• Not friendly to virtualization
• Security issues
Value-add: live migration
• Live migration@virtualization• Zero-page scanning
• frequency reduce
• Tsc scaling
• Guest whitelist
• Live migration@cloud• Event based on xenstore
• Parallel migration
• Close-loop control
• Safely roll-back
• ~100% vm alive when fail
Control system
networkstorageSrcvm
Migration pre-check
Send migrate commandRe-connect image at dst
Dstvm
N-1 iter memcpy
Socket connection,vm config
Last iter memcpySave vm & qemu context Restore memory
Creat vm
Vm suspend
Qemu save/load
Restore vm & qemu
Image disconnect at src
destroy PV devices and qemu Image connect at dst with r/w mode
Vm unpausepause
successfrontend – backend
reconnect
Send gratuitous arp at dstFlush Windows ARP cache(migration success)
Fail or timeout
Tapdisk reopenCreat vif
Qemu recreate and load
Vm resume
Frontend-backend reconnect
Image connect at src(migration fail)
Safely roll-back(migration fail)
Vm run at dst(migratin success)
Vm destory
ELB session copy
Last iterationELB session copy
Xend
Flush cache at src
ELB session cancel(migratin fail)
Judge and brain-split prevent
Vm database update
Live migratio
n status
Storaged Networkd
events
① ②③
Libvirtd
Xenstore
Challenges: GPGPU virtualization
• GPGPU virtualization• Trivial GPU computing capability loss
• GPU – GPU data transfer bottleneck
• Caffe: cudaMemcpy
• GPUDirect P2P data path
• GPUDirect P2P virtualization
• Nvidia P2P probe theory is un-known
• PCIe topology exposing
• gpa -> hpa transfer
• Varies per different server topology
• QPI, IOH, PCIe switch layout
• How to assign GPUset to a vm
• Unified GPGPU virtualization framework?
GPU-0 GPU-1 GPU-2 GPU-3
PCIe Switch
PCIe Switch
CPU-1
CPU-2
①
②
③
CPU
IOH
switch
GPU GPU
CPU
IOH
switch
GPU GPU
440 or Q35Chipset?
GPU
switch
GPU GPU
switch
GPU
gpa
topology expose
√
pass-through
hpa
Hypervisor
X
Challenges: FPGA virtualization
• Not mature/friendly to virtualization• Very useful for AI inference
• FPGA partition for multi-tenants
• Static partition is possible but expensive
• Dynamic partition is not possible currently, depending on Xilinx/Intel tools support
• Pass-through solution works but
• Expensive
• Security issue if FPGA owned by malicious guest
• PCIe bandwidth stuck
• Over-heated
• Security issue if bitstream owned by host and hijacked
• Isolation
• IP and data leaking
Challenges: hot-upgrade and security
• XSA/CVE hot-patch• Only ~75% XSA/CVE security holes can be hot-patched
• data structure, newly added functions, booting stage security holes
• vmexit handler, inline function
• NMI handler
• OS/Hypervisor online upgrade• Live migration is too heavy
• Security• Currently security at cloud environment focus on network anti-attack
• Malicious guest can work around network anti-attack, attcking system from inside
• i.e., Qemu VEMON
• Intelligent hardware
• Smart NIC, FPGA, GPU, etc
Backup
Live migration based on sync-calls
Control system
NetworkStorageSrc vm
Pre-check
Send migrate cmd to src
Image read-only open at dst
Des vm
N-1 iter memcpy
Socket connect,vm config
Send flush cache to src
Last iter memcpySave vm & qemu context Restore memory
Create vm
Vm suspend
Qemu save/load
Tapdisk pause
Restore vm & qemu
Image disconnect at src
Storage & network disconnectDestroy PV devices and qemu Tapdisk open Image r/w open at dst
(migration success)
Vm unpausepause
Waiting shakehand
successFrontend-backend connection
Send gratuitous arp at dstFlush Windows ARP cache(migration success)
Fail or timeout
Tapdisk reopenCreate vif
Recreate qemu and status load
Vm resume
Frontend-backend re-connection
Image r/w open at src(migration fail)
Vm safely roll-back(migration fail)
Vm running at dst(migration success)
Destroy vm
Storage lock
ELB session copy
Last iterELB session copy
Migration start
Flush b-cache at src
①②
③
SLB session cancel at dst(migration fail)
Migration step 1
Migration step 2
Migration step N
Brain-split prevent
Database update
Migra
tion
statu
s
Done
Start
Thank You
Recommended