Upload
ryousei-takano
View
1.404
Download
0
Embed Size (px)
DESCRIPTION
High-Performance Grid and Cloud Computing Workshop
High%Performance/Grid/and/Cloud/Compu6ng/Workshop,/May/20/2013,/Boston �
Ryousei/Takano,/Hidemoto/Nakada,/Takahiro/Hirofuchi,//Yoshio/Tanaka,/and/Tomohiro/Kudoh/
/
Informa(on)Technology)Research)Ins(tute,))Na(onal)Ins(tute)of)Advanced)Industrial)Science)and)Technology)(AIST),)Japan�
Ninja&Migra*on:&An&Interconnect2transparent&Migra*on&for&
Heterogeneous&Data&Centers�
Background�• HPC/cloud/is/a/promising/HPC/plaIorm./
• VM/migra6on/is/useful/for/improving/flexibility/and/maintainability/in/cloud/compu6ng./
��
VM1� VM2� VM3�
VM1� VM2�VM3�
Maintenance,/fault/tolerance,/energy/efficient/VM/placement�
Disaster/recovery/
VM1� VM2� VM3�
Infiniband�
Constraints&on&VM&migra*on�• Migra6on/with/a/VMM%bypass/I/O/device/
– It/can/greatly/reduce/the/overhead/of/virtualiza6on,/but/it/is/not/under/the/control/of/a/VMM./
��
Infiniband�
VM1� VM2� VM3�
Impact&of&VMM2bypass&I/O �
��
0
50
100
150
200
250
300
BT CG EP FT LU
Exe
cutio
n tim
e [s
econ
ds]�
BMM (IB) BMM (10GbE) KVM (IB) KVM (virtio)
The/overhead/of/I/O/virtualiza6on/on/the/NAS/Parallel/Benchmarks/3.3.1/class/C,/64/processes.�
BMM: Bare Metal Machine�
KVM (virtio)�
VM1�
10GbE NIC�
VMM�
Guest driver�
Physical driver�
Guest OS�
KVM (IB)
VM1�
IB QDR HCA�
VMM�
Physical driver�
Guest OS�
• Performance/evalua6on/of/HPC/cloud/– (Para2)virtualized&I/O&incurs/a/large/overhead./– PCI&passthrough&significantly/mi6gate/the/overhead./
Constraints&on&VM&migra*on�• Migra6on/with/a/VMM%bypass/I/O/device/
– It/can/greatly/reduce/the/overhead/of/virtualiza6on,/but/it/is/not/under/the/control/of/a/VMM./
• Heterogeneity/of/interconnect/devices/– A/VM/assigned/to/an/Infiniband/device/cannot/migrate/to/an/Ethernet/machine./
��
Infiniband� Ethernet�
VM1� VM2� VM3�
Challenge �• Goal:/Migrate/a/cluster/of/VMs/between/heterogeneous/data/centers./!Interconnect%transparent/migra6on/
• Challenge:/How/do/we/realize/it/with/the/minimal/overhead/of/virtualiza6on/during/normal/opera6on?/– (Para%)/Virtualized/devices/suffer/from/the/overhead./
��
Outline �• Introduc6on/• Ninja/migra6on:/interconnect%transparent/migra6on/
• Experiment/
• Conclusion �
�
Interconnect2transparent&migra*on�
�
Infiniband�
Normal/opera6on �
VM1� VM2� VM3� Fallback/migra6on �
Recovery/migra6on �
Ethernet�
Fallback/opera6on �
VM1� VM2� VM3�
VM1�VM2�VM3�
VM1�VM2�VM3�
Infiniband/cluster� Ethernet/cluster�
Use&cases:/transparent/fail%over/to/another/cluster/for/maintenance,//evacua6on/from/a/disaster%stricken/data/center,/etc.�
Requirements�1. Detach/VMM%bypass/I/O/devices/only/when/VM/
migra6on/is/required./
2. Global/Coordina6on/among/distributed/VMs/before/migra6on./
3. Change/an/applica6on’s/transport/protocol//for/the/available/device/ager/migra6on./
� /Our/approach:/leverage/the/knowledge/of/an/applica6on/to/ensure/coopera6on/between/migra6on/and/a/communica6on/layer/inside/the/guest/OS./
��
VM�
SymVirt:&Symbio*c&Virtualiza*on�
���
NIC�
VMM�
Applica6on�
Migra6on�
Global/coordina6on�
Device/setup�
Exis*ng&VM&migra*on&(Black%box/approach)/
Pro:/portability�
VM&migra*on&w/&SymVirt&(Gray%box/approach)/Pro:/performance�
NIC�
VM�
VMM�
Applica6on�
VMM#bypass)I/O�
Migra6on�
Global/coordina6on�
Device/setup�
Coopera1on�
Ninja&migra*on�
���
NIC�
VM�
VMM�
Applica6on�
MPI/system�
NIC�
VM�
VMM�
Applica6on�
MPI/system�
NIC�
VM�
VMM�
Applica6on�
MPI/system�
///Ninja/migra6on �
Migra6on�
Global/coordina6on�
Device/setup�
In/conjunc6on/with/VM/migra6on,//MPI/system/is/in/charge/of:/• global/coordina6on/among/MPI/
processes/• changing/a/transport/protocol �
Implementa*on�
���
confirm�
detach �
Guest OS mode �
VMM mode �
migration� re-attach �
confirm�
Application�
SymVirt coordinator (SELF component) �
SymVirt controller/ agent�
linkup�
MPI runtime �
• No/modifica6on/to/either/of/the/MPI/system/and/applica6ons/
• Open/MPI/user%level/checkpoint/restart/framework/(SELF)/
– Global/coordina6on/protocol/– Re%establishes/connec6ons/among/MPI/processes/ager/migra6on/
Migra6on�
Device/setup�
Global/coordina6on�
Outline �• Introduc6on/• Ninja/migra6on:/interconnect%transparent/migra6on/
• Experiment/
• Conclusion �
���
Experiment�
���
• The/overhead/of/Ninja/migra6on/– We/used/8/VMs/on/a/cluster./
– We/migrated/VMs/once/during/a/benchmark/execu6on./
• Fallback/and/recovery/migra6on/between/an/Infiniband/cluster/and/an/Ethernet/cluster/– Infiniband:/VMM%bypass/I/O/(PCI/passthrough)/
– Ethernet:/Para%virtualized/I/O/(vir6o_net)/• Two/benchmark/programs/wriken/in/MPI/
– memtest:/a/simple/memory/intensive/benchmark/
– NAS/Parallel/Benchmarks/(NPB)/version/3.3.1/
Experimental&SeRng�We/used/a/16/node/Infiniband/cluster. �
���
Blade&server&�Dell&PowerEdge&M610��CPU � Intel/quad%core/Xeon/E5540/2.53GHz/x2�
Chipset � Intel/5520�Memory� 48/GB/DDR3�InfiniBand/ Mellanox/ConnectX/(MT26428) �10/GbE/ Broadcom/NetXtreme/II/(BMC57711) �
Blade&switch�InfiniBand � Mellanox/M3601Q/(QDR/16/ports) �10GbE � Dell/M8024�
Host&machine&environment�OS � Debian/7.0�Linux/kernel � 3.2.18�
QEMU/KVM� 1.1%rc3�MPI� Open/MPI/1.6�OFED � 1.5.4.1�Compiler � gcc/gfortran/4.4.6�
VM&environment�VCPU� 8�Memory� 20/GB �
Result:&memtest �• The/overhead/of/Ninja/migra6on/
– The/migra6on/6me/depends/on/the/memory/footprint./
– Both/hotplug/and/link%up/6mes/are/almost/constant./
���
28.5 28.5 28.5 28.6
14.6 13.5 12.5 11.3
35.9 38.7 44.2 53.7
0
20
40
60
80
100
2GB 4GB 8GB 16GB Execu*
on&Tim
e&[Secon
ds]�
migra6on/ hotplug/ linkup/
memory&footprint�
Result:&link2up&*me �
src.&device&2>&dest.&device � hotplug � link2up�Infiniband/%>/Infiniband � 3.88� 29.9�Ethernet/%>/Infiniband � 1.15� 29.8�Ethernet/%>/Ethernet� 0.13� 0.00�Infiniband/%>/Ethernet� 2.80� 0.00�
��
• Focus/on/the/link%up/6me./– Note:/the/source/and/the/des6na6on/are/the/same/node./
• If/the/des6na6on/has/an/Infiniband/device,/the//link%up/6me/is/not/a/negligible/overhead./
[seconds]�
Result:&NPB&(64&proc.,&Class&D)�
��
0
200
400
600
800
1000
1200
baseline migration baseline migration baseline migration baseline migration
BT CG FT LU
Exe
cutio
n tim
e [s
econ
ds]� migra6on/ hotplug/ linkup/ applica6on/
BT � CG � FT � LU�4417� 3394� 15678� 2348�
Transferred&Memory&Size&during&VM&Migra*on&[MB]�
There/is/no/overhead/during/normal/opera6ons � The/overhead/is/propor6onal/
to/the/memory/footprint. �
+8% �
+14% �
+37% �
+11% �
Fallback/Recovery&migra*on:&memtest �
���
0
20
40
60
80
100
1 11 21 31
Exe
cutio
n tim
e [s
econ
ds]�
Steps �
Overhead/
Applica6on/
4/hosts//(IB) �
2/hosts//(TCP) �
4/hosts//(IB) �
4/hosts//(TCP) �
0
40
80
120
160
200
1 11 21 31
Exe
cutio
n tim
e [s
econ
ds]�
Steps �
Overhead/Applica6on/
4/hosts//(IB) �
2/hosts//(TCP) �
4/hosts//(IB) �
4/hosts//(TCP) �
Total&4&proc.&(1&proc.&/&VM)� Total&32&proc.&(8&proc.&/&VM)�
Outline �• Introduc6on/• Ninja/migra6on:/interconnect%transparent/migra6on/
• Experiment/
• Conclusion �
���
Related&Work�• Heterogeneous/VM/migra6on/
– Vagrant/supports/a/live/migra6on/across/heterogeneous/VMM./[P./Liu/‘08]/
– Ninja/migra6on/provides/interconnect%transparent/migra6on./
• VM/migra6on/with/VMM%bypass/I/O/devices/
– Driver2level:/shadow/driver/[A./Kadav/‘09],//Nomad/[W./Huang/‘07]/
– Run*me2level:/Ninja/migra6on �
���
Conclusion �• We/propose/an/interconnect%transparent/migra6on/mechanism/to/migrate/a/bunch/of/VMs/between/heterogeneous/data/centers./
• We/demonstrate/the/implementa6on/called/Ninja/migra6on./
– VMs/can/migrate/between/an/IB/cluster/and/an/Ethernet/cluster/without/restar6ng/an/applica6on. �
– It/has/no/performance/overhead/during/normal/opera6ons./
���
Future&work�• Demonstrate/the/scalability./
• Inves6gate/the/very/long/link%up/6me/of/Infiniband./
• Design/a/run6me%agnos6c/(MPI/free)/implementa6on./ �
���
This/work/was/partly/supported/by/JSPS/KAKENHI//Grant/Number/24700040. �