Upload
vmworld
View
309
Download
3
Embed Size (px)
DESCRIPTION
VMworld 2013 Peter Boone, VMware Seongbeom Kim, VMware Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Citation preview
Extreme Performance Series:
Monster Virtual Machines
Peter Boone, VMware
Seongbeom Kim, VMware
VSVC4811
#VSVC4811
2
Goals
Overview of vSphere CPU/memory management features
Highlight monster VM performance
Explain key features for the monster performance
Recommendations
3
Agenda
Technical Overview
• vSphere Architecture
• Memory Management
• CPU Scheduler
Monster Performance
Recommendations
Useful Resources
Extreme Performance Series Sessions
4
VMware vSphere Architecture
Physical Hardware
Guest Guest TCP/IP
File
System
VMkernel
Monitor (BT, HW)
Memory
Allocator
NIC Drivers
Virtual Switch
I/O Drivers
File System
Monitor
Scheduler
Virtual NIC Virtual SCSI
CPU/memory
is managed
by vmkernel
and
virtualized by
monitor.
5
Agenda
Technical Overview
• vSphere Architecture
• Memory Management
• Transparent Page Sharing
• Ballooning
• Compression, Swapping
• esxtop Counters Basics -- Memory
• CPU Scheduler
Monster Performance
Recommendations
…
6
Memory – Overview
A VM’s RAM is not necessarily physical RAM
Allocation depends on…
• Host configuration
• Shares
• Limits
• Reservations
• Host load
• Idle/Active VMs
VMware memory reclamation technologies:
• Transparent Page Sharing
• Ballooning
• Compression / Swapping
7
Memory – Transparent Page Sharing
8
Memory – Ballooning
9
Memory – Compression
10
Memory – Swapping
11
Memory – Swapping
12
esxtop Counter Basics - Memory
13
esxtop Counter Basics - Memory
Interpret the esxtop columns correctly
MEMSZ –MB currently configured.
GRANT – Amount of memory mapped to a resource pool or virtual
machine.
SZTGT – Amount of machine memory the ESXi VMkernel wants to
allocate to a resource pool or virtual machine.
TCHD – Working set (Active) estimate for the resource pool or
virtual machine over last few minutes
TCHD_W – Working set writes
SW* – Swap counters (Current, Target, reads, writes)
14
Agenda
Technical Overview
• vSphere Architecture
• Memory Management
• CPU Scheduler
• Overhead
• Ready Time
• NUMA, vSMP
• esxtop Counters Basics -- CPU
Monster Performance
Recommendations
…
15
CPU – Overview
Raw processing power of a given host or VM
• Hosts provide CPU resources
• VMs and Resource Pools consume CPU resources
CPU cores/threads need to be shared between VMs
vCPU scheduling challenges
• Fair scheduling
• High responsiveness and throughput
• Virtual interrupts from the guest OS
• vSMP, Co-scheduling
• I/O handling
16
CPU – Performance Overhead & Utilization
Different workloads have different overhead costs even for the
same CPU utilization
CPU virtualization adds varying amounts of system overhead
• Direct execution vs. privileged execution
• Paravirtual adapters vs. emulated adaptors
• Virtual hardware (Interrupts!)
• Network and storage I/O
17
CPU – Ready Time
The percentage of time that a vCPU is ready to execute, but waiting
for physical CPU time
Does not necessarily indicate a problem
• Indicates possible CPU contention or limits
18
CPU – Contention and Execution Delay
19
CPU – NUMA nodes
Non-Uniform Memory Access system architecture
• Each node consists of CPU cores and memory
A pCPU can cross NUMA nodes, but at a performance cost
• Access time can be 30% ~ 100% longer
NUMA node 1 NUMA node 2
20
CPU – NUMA nodes
21
CPU – vSMP
Relaxed Co-Scheduling: vCPUs can run out-of-sync
Idle vCPUs will waste pCPU resources
• Idle CPUs won’t improve application performance!
• Configure only as many vCPUs as actually needed for each VM
Use Uniprocessor VMs for single-threaded applications
22
CPU – Scheduling
Over committing physical CPUs
VMkernel CPU Scheduler
23
CPU – Scheduling
Over committing physical CPUs
VMkernel CPU Scheduler
X X
24
CPU – Scheduling
Over committing physical CPUs
VMkernel CPU Scheduler
X X X X
25
esxtop Counter Basics - CPU
26
esxtop Counter Basics - CPU
Interpret the esxtop columns correctly
%RUN – Percentage of time in a running state
%USED – Actual physical CPU usage
%SYS – Kernel time (system services, interrupts…)
%WAIT – Percentage of time in blocked or busy wait states
27
Agenda
Technical Overview
Monster Performance
• Evolution of Monster VM
• Performance Results
• Key Techniques
Recommendations
Useful Resources
Extreme Performance Series Sessions
28
Evolution of Monster VM
1
2
4
8
16
32
64
128
256
512
1024
1
2
4
8
16
32
64
1.0 2.0 3.0 3.5 4.0 4.1 5.0 5.x
Ma
x v
RA
M / V
M (
GB
)
Ma
x v
CP
Us
/ V
M
Compute Memory
1-vCPU
2GB
64-vCPU
1TB
64x & 512x increase in VM limits.
29
Agenda
Technical Overview
Monster Performance
• Evolution of Monster VM
• Performance Results
• In-memory DB
• HPC Workload
• OLAP/OLTP Workload
• Key Techniques
Recommendations
Useful Resources
Extreme Performance Series Sessions
30
In-memory DB
TATP Benchmark
• Telecommunication Application
Transaction Processing
• Simulate Home Location Register (HLR)
application used by mobile carriers
• Requires high-throughput for real time
transactional application
• 1TB memory hold 800 million
subscribers data in solidDB
IBM x3850 X5
4 sockets, 32 cores, 64 hyper-threads
1.5 TB
64-vCPU
1TB
…
solidDB
31
In-memory DB
IBM x3850 X5
4 sockets, 32 cores, 64 hyper-threads
1.5 TB
… 64-vCPU
1TB
solidDB
Entire US population
32
In-memory DB (cont’d)
1.00
2.00
4.00
8.00
16.00
32.00
64.00
1 2 4 8 16 32
Th
rou
gh
pu
t
# DB Connections
Throughput Scaling
115K
trans./sec
Throughput scales linearly.
better
33
HPC Workload
SPEC OMP
• Scientific workload parallelized using
OpenMP
• Water modeling, earthquake modeling,
crash simulation, etc.
• Up to ~50x speedup compared to
single threaded run
• e.g. 1 hour instead of 2 days
HP DL980 G7
8 sockets, 64 cores, 128 hyper-threads
512 GB
64-vCPU
128 GB
…
SPEC OMP
34
HPC Workload: Scalability
better
1
2
4
8
16
4 8 16 32 64
Ideal Wupwise-N Wupwise-V Swim-N Swim-V
Apsi-N Apsi-V Art-N Art-V
VM scales as well as bare metal.
35
HPC Workload: Comparison to Bare-metal
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Equake Wupwise Swim Applu Apsi Fma3d Art Average
Sp
ee
du
p
better
95% of bare
metal
performance
36
OLAP/OLTP*
Standard Mixed Database
Workload
• Mixed OLAP & OLTP, DB Size=150GB
Enhanced Mixed Load
• Mixed OLAP Query Execution & Data
Loading, DB Size=400 million records
Dell R910
4 sockets, 40 cores, 80 hyper-threads
1 TB
40-vCPU
512 GB
…
HANA
SLES 11 sp2
37
OLAP/OLTP* (cont’d)
(*) VAPP5591: Big Data: Virtualized SAP HANA Performance, Scalability and Best Practices
0.00
0.20
0.40
0.60
0.80
1.00
1.20
SML EML Throughput EML Response Time
Sp
ee
du
p
95% of bare
metal
performance
better
38
Agenda
Technical Overview
Monster Performance
• Evolution of Monster VM
• Performance Results
• Key Techniques
• Scalable Synchronization
• vNUMA
• IO Contexts
Recommendations
Useful Resources
Extreme Performance Series Sessions
39
Scalable Synchronization
Problem
• High number of vCPUs means heavy synchronization
• Memory virtualization
• Memory allocation, page sharing, NUMA remapping, etc.
• Co-scheduling
• Critical to monster VM’s performance
Solutions
• Hashed lock instead of global lock
• Scalable lock primitive instead of a spin lock
• Feature built into vmkernel
Results
• Boot time of huge VM reduces by an order of magnitude
40
Scalable Synchronization (cont’d)
…
…
A A B
A B
Guest
Host A*
A*
Heavy lock contention hamstrings monster VM.
41
Scalable Synchronization (cont’d)
…
…
A A B
A B
Guest
Host A*
A*
Reducing lock contention improves scalability.
42
vNUMA
Problem
• Monster VM has more vCPUs than #cores per NUMA node
• Guest application/OS has no knowledge of underlying NUMA
• Suboptimal processes and memory placement performance loss
Solutions
• Expose virtual NUMA to VM
• Guest application/OS achieves optimal placement of processes and memory
• Improved memory locality performance gain
• Automatically enabled for a wide VM
Results
• Up to 70% performance improvement
43
vNUMA (cont’d)
C5 C4 C3
C0 C1 C2
C5 C4 C3
C0 C1 C2
C5 C4 C3
C0 C1 C2
C5 C4 C3
C0 C1 C2
Good memory locality.
44
vNUMA (cont’d)
C5 C4 C3
C0 C1 C2
C5 C4 C3
C0 C1 C2
C5 C4 C3
C0 C1 C2
C5 C4 C3
C0 C1 C2
Poor memory locality without vNUMA.
45
vNUMA (cont’d)
C5 C4 C3
C0 C1 C2
C5 C4 C3
C0 C1 C2
C5 C4 C3
C0 C1 C2
C5 C4 C3
C0 C1 C2
Good memory locality with vNUMA.
46
vNUMA (contd.)
Enabled when…
• vHW version 8 or higher
• #vCPUs / VM > #cores / NUMA node
• Hyper-Threads don’t count
• VM is configured with 9 or higher vCPUs
• Can be lowered via numa.vcpu.min
Determined when…
• VM’s first power-on after creation
• Persistent across vmotion, suspend/resume, snapshot
• Best to keep your cluster homogeneous
• Consider powering on VM on a host with smaller NUMA node
vNUMA != pNUMA
• vCPUs per VM
• #cores per NUMA node
47
IO Contexts
Problem
• How to achieve extremely high IO throughput?
Solutions
• Offload IO processing to separate IO contexts
• Exploits idle cores
• Parallelism at every layers of IO processing
• Affinity aware scheduling
• Schedule contexts with heavy communication together
• Benefit from cache sharing
• Feature built into vmkernel
Results
• 1 Million IOPS
• Internal testing reached 26Gbps with single VM, multiple vNICs
48
IO Contexts: Offload IO Processing to Separate IO Contexts
L1
L2
C0
L1
C1
IO
virt.
Delaying next
IO issue
Serializing VM execution and
IO processing.
49
IO Contexts: Offload IO Processing to Separate IO Contexts
L1
L2
C0
L1
C1
Not delaying VM.
Faster IO issuing.
Higher throughput.
Exploiting idle cores and
parallelism.
50
L1
L2
C0
L1
C1
IO Contexts: Affinity Aware Scheduling
L1
L2
C0
L1
C1
Efficient
communication on
shared cache.
Expensive
communication on
remote cache.
vSphere collocates communicating contexts (affinity aware scheduling).
51
Agenda
Technical Overview
Monster Performance
Recommendations
• Hardware Features
• Avoid Pitfalls
Useful Resources
Extreme Performance Series Sessions
52
Hardware Features
Hardware support for virtualization
• More efficient CPU/MMU virtualization
• Significant performance benefit
NUMA (Non Uniform Memory Access)
• On-chip memory controller
• Higher memory bandwidth to feed multi-cores
• Scheduling makes significant performance impact
• ESXi is optimized for NUMA
Hyper-Threading
• Two logical processors per physical processor
• Total throughput is noticeably higher
• Has little effect on single threaded performance
53
Hardware Features (cont’d)
More cores vs. Higher clock frequency
• Single threaded application won’t benefit from multi-cores
• Latency sensitive workload benefits from faster CPU
• CPU with higher frequency often comes with bigger cache
Bigger DRAM vs. faster DRAM
• Cache (10s cycles) << DRAM (10s ns) << Flash (10s usec) << Disk (1 msec)
• Faster DRAM makes sense if capacity is not a concern
• Bigger DRAM makes sense to avoid going to disks
54
Hardware Features (cont’d)
Power management
• Allow vSphere to control power management policy
• “OS Control” mode in BIOS
• Default vSphere policy is to save power with minimal performance impact
• Enable C-states for more power saving
• Turn off processor components at halt
Turbo boost and C-state
• Up to 10% performance benefit with Turbo-boost
• Enabling C-state makes turbo-boost more effective
• Both should be enabled
SSD
• Under memory overcommit, enable Swap-to-SSD (llSwap)
• ~10% performance improvement
55
Avoid Pitfalls
Avoid high active host memory over-commitment
• No host swapping occurs when total memory demand is less than the physical
memory (Assuming no limits)
Right-size guest VMs
• vCPU
• vRAM
• Guest-level paging indicates too small vRAM
Use a fully automated DRS cluster
56
Avoid Pitfalls (cont’d)
vSocket != pSocket
• Use default unless you have good
reason
• 1 core per virtual socket
• N virtual sockets for N-vCPU VM
• vSocket dictates vNUMA
• Understand the implication on vNUMA
• Careful with CPUs with two NUMA
nodes per socket
57
Agenda
Technical Overview
Monster Performance
Recommendations
Useful Resources
Extreme Performance Series Sessions
58
Performance Community Resources
Performance Technology Pages
• http://www.vmware.com/technical-resources/performance/resources.html
Technical Marketing Blog
• http://blogs.vmware.com/vsphere/performance/
Performance Engineering Blog VROOM!
• http://blogs.vmware.com/performance
Performance Community Forum
• http://communities.vmware.com/community/vmtn/general/performance
Virtualizing Business Critical Applications
• http://www.vmware.com/solutions/business-critical-apps/
59
Performance Technical Resources
Performance Technical Papers
• http://www.vmware.com/resources/techresources/cat/91,96
Performance Best Practices
• http://www.youtube.com/watch?v=tHL6Vu3HoSA
• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.0.pdf
• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.1.pdf
• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.0.pdf
• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.1.pdf
Troubleshooting Performance Related Problems in vSphere
Environments
• http://communities.vmware.com/docs/DOC-14905 (vSphere 4.1)
• http://communities.vmware.com/docs/DOC-19166 (vSphere 5)
• http://communities.vmware.com/docs/DOC-23094 (vSphere 5.x with vCOps)
60
Performance Technical Resources (contd.)
Resource Management Guide
• https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs.html
The CPU Scheduler in VMware vSphere 5.1
• http://www.vmware.com/resources/techresources/10345
Understanding Memory Management in VMware vSphere 5
• http://www.vmware.com/resources/techresources/10206
Host Power Management in VMware vSphere 5.5
• http://www.vmware.com/files/pdf/techpaper/hpm-perf-vsphere55.pdf
61
Don’t miss:
vCenter of the Universe – Session # VSVC5234
Monster Virtual Machines – Session # VSVC4811
Network Speed Ahead – Session # VSVC5596
Storage in a Flash – Session # VSVC5603
Big Data:
Virtualized SAP HANA Performance, Scalability and Practices –
Session # VAPP5591
62
Other VMware Activities Related to This Session
HOL:
HOL-SDC-1304
vSphere Performance Optimization
HOL-SDC-1317
vCloud Suite Use Cases - Business Critical Applications
Group Discussions:
VSVC1001-GD
Performance with Mark Achtemichuk
VSVC4811
THANK YOU
Extreme Performance Series:
Monster Virtual Machines
Peter Boone, VMware
Seongbeom Kim, VMware
VSVC4811
#VSVC4811