65
Extreme Performance Series: Monster Virtual Machines Peter Boone, VMware Seongbeom Kim, VMware VSVC4811 #VSVC4811

VMworld 2013: Extreme Performance Series: Monster Virtual Machines

  • Upload
    vmworld

  • View
    309

  • Download
    3

Embed Size (px)

DESCRIPTION

VMworld 2013 Peter Boone, VMware Seongbeom Kim, VMware Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare

Citation preview

Page 1: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

Extreme Performance Series:

Monster Virtual Machines

Peter Boone, VMware

Seongbeom Kim, VMware

VSVC4811

#VSVC4811

Page 2: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

2

Goals

Overview of vSphere CPU/memory management features

Highlight monster VM performance

Explain key features for the monster performance

Recommendations

Page 3: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

3

Agenda

Technical Overview

• vSphere Architecture

• Memory Management

• CPU Scheduler

Monster Performance

Recommendations

Useful Resources

Extreme Performance Series Sessions

Page 4: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

4

VMware vSphere Architecture

Physical Hardware

Guest Guest TCP/IP

File

System

VMkernel

Monitor (BT, HW)

Memory

Allocator

NIC Drivers

Virtual Switch

I/O Drivers

File System

Monitor

Scheduler

Virtual NIC Virtual SCSI

CPU/memory

is managed

by vmkernel

and

virtualized by

monitor.

Page 5: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

5

Agenda

Technical Overview

• vSphere Architecture

• Memory Management

• Transparent Page Sharing

• Ballooning

• Compression, Swapping

• esxtop Counters Basics -- Memory

• CPU Scheduler

Monster Performance

Recommendations

Page 6: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

6

Memory – Overview

A VM’s RAM is not necessarily physical RAM

Allocation depends on…

• Host configuration

• Shares

• Limits

• Reservations

• Host load

• Idle/Active VMs

VMware memory reclamation technologies:

• Transparent Page Sharing

• Ballooning

• Compression / Swapping

Page 7: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

7

Memory – Transparent Page Sharing

Page 8: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

8

Memory – Ballooning

Page 9: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

9

Memory – Compression

Page 10: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

10

Memory – Swapping

Page 11: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

11

Memory – Swapping

Page 12: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

12

esxtop Counter Basics - Memory

Page 13: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

13

esxtop Counter Basics - Memory

Interpret the esxtop columns correctly

MEMSZ –MB currently configured.

GRANT – Amount of memory mapped to a resource pool or virtual

machine.

SZTGT – Amount of machine memory the ESXi VMkernel wants to

allocate to a resource pool or virtual machine.

TCHD – Working set (Active) estimate for the resource pool or

virtual machine over last few minutes

TCHD_W – Working set writes

SW* – Swap counters (Current, Target, reads, writes)

Page 14: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

14

Agenda

Technical Overview

• vSphere Architecture

• Memory Management

• CPU Scheduler

• Overhead

• Ready Time

• NUMA, vSMP

• esxtop Counters Basics -- CPU

Monster Performance

Recommendations

Page 15: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

15

CPU – Overview

Raw processing power of a given host or VM

• Hosts provide CPU resources

• VMs and Resource Pools consume CPU resources

CPU cores/threads need to be shared between VMs

vCPU scheduling challenges

• Fair scheduling

• High responsiveness and throughput

• Virtual interrupts from the guest OS

• vSMP, Co-scheduling

• I/O handling

Page 16: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

16

CPU – Performance Overhead & Utilization

Different workloads have different overhead costs even for the

same CPU utilization

CPU virtualization adds varying amounts of system overhead

• Direct execution vs. privileged execution

• Paravirtual adapters vs. emulated adaptors

• Virtual hardware (Interrupts!)

• Network and storage I/O

Page 17: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

17

CPU – Ready Time

The percentage of time that a vCPU is ready to execute, but waiting

for physical CPU time

Does not necessarily indicate a problem

• Indicates possible CPU contention or limits

Page 18: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

18

CPU – Contention and Execution Delay

Page 19: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

19

CPU – NUMA nodes

Non-Uniform Memory Access system architecture

• Each node consists of CPU cores and memory

A pCPU can cross NUMA nodes, but at a performance cost

• Access time can be 30% ~ 100% longer

NUMA node 1 NUMA node 2

Page 20: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

20

CPU – NUMA nodes

Page 21: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

21

CPU – vSMP

Relaxed Co-Scheduling: vCPUs can run out-of-sync

Idle vCPUs will waste pCPU resources

• Idle CPUs won’t improve application performance!

• Configure only as many vCPUs as actually needed for each VM

Use Uniprocessor VMs for single-threaded applications

Page 22: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

22

CPU – Scheduling

Over committing physical CPUs

VMkernel CPU Scheduler

Page 23: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

23

CPU – Scheduling

Over committing physical CPUs

VMkernel CPU Scheduler

X X

Page 24: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

24

CPU – Scheduling

Over committing physical CPUs

VMkernel CPU Scheduler

X X X X

Page 25: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

25

esxtop Counter Basics - CPU

Page 26: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

26

esxtop Counter Basics - CPU

Interpret the esxtop columns correctly

%RUN – Percentage of time in a running state

%USED – Actual physical CPU usage

%SYS – Kernel time (system services, interrupts…)

%WAIT – Percentage of time in blocked or busy wait states

Page 27: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

27

Agenda

Technical Overview

Monster Performance

• Evolution of Monster VM

• Performance Results

• Key Techniques

Recommendations

Useful Resources

Extreme Performance Series Sessions

Page 28: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

28

Evolution of Monster VM

1

2

4

8

16

32

64

128

256

512

1024

1

2

4

8

16

32

64

1.0 2.0 3.0 3.5 4.0 4.1 5.0 5.x

Ma

x v

RA

M / V

M (

GB

)

Ma

x v

CP

Us

/ V

M

Compute Memory

1-vCPU

2GB

64-vCPU

1TB

64x & 512x increase in VM limits.

Page 29: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

29

Agenda

Technical Overview

Monster Performance

• Evolution of Monster VM

• Performance Results

• In-memory DB

• HPC Workload

• OLAP/OLTP Workload

• Key Techniques

Recommendations

Useful Resources

Extreme Performance Series Sessions

Page 30: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

30

In-memory DB

TATP Benchmark

• Telecommunication Application

Transaction Processing

• Simulate Home Location Register (HLR)

application used by mobile carriers

• Requires high-throughput for real time

transactional application

• 1TB memory hold 800 million

subscribers data in solidDB

IBM x3850 X5

4 sockets, 32 cores, 64 hyper-threads

1.5 TB

64-vCPU

1TB

solidDB

Page 31: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

31

In-memory DB

IBM x3850 X5

4 sockets, 32 cores, 64 hyper-threads

1.5 TB

… 64-vCPU

1TB

solidDB

Entire US population

Page 32: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

32

In-memory DB (cont’d)

1.00

2.00

4.00

8.00

16.00

32.00

64.00

1 2 4 8 16 32

Th

rou

gh

pu

t

# DB Connections

Throughput Scaling

115K

trans./sec

Throughput scales linearly.

better

Page 33: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

33

HPC Workload

SPEC OMP

• Scientific workload parallelized using

OpenMP

• Water modeling, earthquake modeling,

crash simulation, etc.

• Up to ~50x speedup compared to

single threaded run

• e.g. 1 hour instead of 2 days

HP DL980 G7

8 sockets, 64 cores, 128 hyper-threads

512 GB

64-vCPU

128 GB

SPEC OMP

Page 34: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

34

HPC Workload: Scalability

better

1

2

4

8

16

4 8 16 32 64

Ideal Wupwise-N Wupwise-V Swim-N Swim-V

Apsi-N Apsi-V Art-N Art-V

VM scales as well as bare metal.

Page 35: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

35

HPC Workload: Comparison to Bare-metal

0.00

0.20

0.40

0.60

0.80

1.00

1.20

Equake Wupwise Swim Applu Apsi Fma3d Art Average

Sp

ee

du

p

better

95% of bare

metal

performance

Page 36: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

36

OLAP/OLTP*

Standard Mixed Database

Workload

• Mixed OLAP & OLTP, DB Size=150GB

Enhanced Mixed Load

• Mixed OLAP Query Execution & Data

Loading, DB Size=400 million records

Dell R910

4 sockets, 40 cores, 80 hyper-threads

1 TB

40-vCPU

512 GB

HANA

SLES 11 sp2

Page 37: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

37

OLAP/OLTP* (cont’d)

(*) VAPP5591: Big Data: Virtualized SAP HANA Performance, Scalability and Best Practices

0.00

0.20

0.40

0.60

0.80

1.00

1.20

SML EML Throughput EML Response Time

Sp

ee

du

p

95% of bare

metal

performance

better

Page 38: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

38

Agenda

Technical Overview

Monster Performance

• Evolution of Monster VM

• Performance Results

• Key Techniques

• Scalable Synchronization

• vNUMA

• IO Contexts

Recommendations

Useful Resources

Extreme Performance Series Sessions

Page 39: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

39

Scalable Synchronization

Problem

• High number of vCPUs means heavy synchronization

• Memory virtualization

• Memory allocation, page sharing, NUMA remapping, etc.

• Co-scheduling

• Critical to monster VM’s performance

Solutions

• Hashed lock instead of global lock

• Scalable lock primitive instead of a spin lock

• Feature built into vmkernel

Results

• Boot time of huge VM reduces by an order of magnitude

Page 40: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

40

Scalable Synchronization (cont’d)

A A B

A B

Guest

Host A*

A*

Heavy lock contention hamstrings monster VM.

Page 41: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

41

Scalable Synchronization (cont’d)

A A B

A B

Guest

Host A*

A*

Reducing lock contention improves scalability.

Page 42: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

42

vNUMA

Problem

• Monster VM has more vCPUs than #cores per NUMA node

• Guest application/OS has no knowledge of underlying NUMA

• Suboptimal processes and memory placement performance loss

Solutions

• Expose virtual NUMA to VM

• Guest application/OS achieves optimal placement of processes and memory

• Improved memory locality performance gain

• Automatically enabled for a wide VM

Results

• Up to 70% performance improvement

Page 43: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

43

vNUMA (cont’d)

C5 C4 C3

C0 C1 C2

C5 C4 C3

C0 C1 C2

C5 C4 C3

C0 C1 C2

C5 C4 C3

C0 C1 C2

Good memory locality.

Page 44: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

44

vNUMA (cont’d)

C5 C4 C3

C0 C1 C2

C5 C4 C3

C0 C1 C2

C5 C4 C3

C0 C1 C2

C5 C4 C3

C0 C1 C2

Poor memory locality without vNUMA.

Page 45: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

45

vNUMA (cont’d)

C5 C4 C3

C0 C1 C2

C5 C4 C3

C0 C1 C2

C5 C4 C3

C0 C1 C2

C5 C4 C3

C0 C1 C2

Good memory locality with vNUMA.

Page 46: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

46

vNUMA (contd.)

Enabled when…

• vHW version 8 or higher

• #vCPUs / VM > #cores / NUMA node

• Hyper-Threads don’t count

• VM is configured with 9 or higher vCPUs

• Can be lowered via numa.vcpu.min

Determined when…

• VM’s first power-on after creation

• Persistent across vmotion, suspend/resume, snapshot

• Best to keep your cluster homogeneous

• Consider powering on VM on a host with smaller NUMA node

vNUMA != pNUMA

• vCPUs per VM

• #cores per NUMA node

Page 47: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

47

IO Contexts

Problem

• How to achieve extremely high IO throughput?

Solutions

• Offload IO processing to separate IO contexts

• Exploits idle cores

• Parallelism at every layers of IO processing

• Affinity aware scheduling

• Schedule contexts with heavy communication together

• Benefit from cache sharing

• Feature built into vmkernel

Results

• 1 Million IOPS

• Internal testing reached 26Gbps with single VM, multiple vNICs

Page 48: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

48

IO Contexts: Offload IO Processing to Separate IO Contexts

L1

L2

C0

L1

C1

IO

virt.

Delaying next

IO issue

Serializing VM execution and

IO processing.

Page 49: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

49

IO Contexts: Offload IO Processing to Separate IO Contexts

L1

L2

C0

L1

C1

Not delaying VM.

Faster IO issuing.

Higher throughput.

Exploiting idle cores and

parallelism.

Page 50: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

50

L1

L2

C0

L1

C1

IO Contexts: Affinity Aware Scheduling

L1

L2

C0

L1

C1

Efficient

communication on

shared cache.

Expensive

communication on

remote cache.

vSphere collocates communicating contexts (affinity aware scheduling).

Page 51: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

51

Agenda

Technical Overview

Monster Performance

Recommendations

• Hardware Features

• Avoid Pitfalls

Useful Resources

Extreme Performance Series Sessions

Page 52: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

52

Hardware Features

Hardware support for virtualization

• More efficient CPU/MMU virtualization

• Significant performance benefit

NUMA (Non Uniform Memory Access)

• On-chip memory controller

• Higher memory bandwidth to feed multi-cores

• Scheduling makes significant performance impact

• ESXi is optimized for NUMA

Hyper-Threading

• Two logical processors per physical processor

• Total throughput is noticeably higher

• Has little effect on single threaded performance

Page 53: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

53

Hardware Features (cont’d)

More cores vs. Higher clock frequency

• Single threaded application won’t benefit from multi-cores

• Latency sensitive workload benefits from faster CPU

• CPU with higher frequency often comes with bigger cache

Bigger DRAM vs. faster DRAM

• Cache (10s cycles) << DRAM (10s ns) << Flash (10s usec) << Disk (1 msec)

• Faster DRAM makes sense if capacity is not a concern

• Bigger DRAM makes sense to avoid going to disks

Page 54: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

54

Hardware Features (cont’d)

Power management

• Allow vSphere to control power management policy

• “OS Control” mode in BIOS

• Default vSphere policy is to save power with minimal performance impact

• Enable C-states for more power saving

• Turn off processor components at halt

Turbo boost and C-state

• Up to 10% performance benefit with Turbo-boost

• Enabling C-state makes turbo-boost more effective

• Both should be enabled

SSD

• Under memory overcommit, enable Swap-to-SSD (llSwap)

• ~10% performance improvement

Page 55: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

55

Avoid Pitfalls

Avoid high active host memory over-commitment

• No host swapping occurs when total memory demand is less than the physical

memory (Assuming no limits)

Right-size guest VMs

• vCPU

• vRAM

• Guest-level paging indicates too small vRAM

Use a fully automated DRS cluster

Page 56: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

56

Avoid Pitfalls (cont’d)

vSocket != pSocket

• Use default unless you have good

reason

• 1 core per virtual socket

• N virtual sockets for N-vCPU VM

• vSocket dictates vNUMA

• Understand the implication on vNUMA

• Careful with CPUs with two NUMA

nodes per socket

Page 57: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

57

Agenda

Technical Overview

Monster Performance

Recommendations

Useful Resources

Extreme Performance Series Sessions

Page 58: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

58

Performance Community Resources

Performance Technology Pages

• http://www.vmware.com/technical-resources/performance/resources.html

Technical Marketing Blog

• http://blogs.vmware.com/vsphere/performance/

Performance Engineering Blog VROOM!

• http://blogs.vmware.com/performance

Performance Community Forum

• http://communities.vmware.com/community/vmtn/general/performance

Virtualizing Business Critical Applications

• http://www.vmware.com/solutions/business-critical-apps/

Page 59: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

59

Performance Technical Resources

Performance Technical Papers

• http://www.vmware.com/resources/techresources/cat/91,96

Performance Best Practices

• http://www.youtube.com/watch?v=tHL6Vu3HoSA

• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.0.pdf

• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.1.pdf

• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.0.pdf

• http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.1.pdf

Troubleshooting Performance Related Problems in vSphere

Environments

• http://communities.vmware.com/docs/DOC-14905 (vSphere 4.1)

• http://communities.vmware.com/docs/DOC-19166 (vSphere 5)

• http://communities.vmware.com/docs/DOC-23094 (vSphere 5.x with vCOps)

Page 60: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

60

Performance Technical Resources (contd.)

Resource Management Guide

• https://www.vmware.com/support/pubs/vsphere-esxi-vcenter-server-pubs.html

The CPU Scheduler in VMware vSphere 5.1

• http://www.vmware.com/resources/techresources/10345

Understanding Memory Management in VMware vSphere 5

• http://www.vmware.com/resources/techresources/10206

Host Power Management in VMware vSphere 5.5

• http://www.vmware.com/files/pdf/techpaper/hpm-perf-vsphere55.pdf

Page 61: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

61

Don’t miss:

vCenter of the Universe – Session # VSVC5234

Monster Virtual Machines – Session # VSVC4811

Network Speed Ahead – Session # VSVC5596

Storage in a Flash – Session # VSVC5603

Big Data:

Virtualized SAP HANA Performance, Scalability and Practices –

Session # VAPP5591

Page 62: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

62

Other VMware Activities Related to This Session

HOL:

HOL-SDC-1304

vSphere Performance Optimization

HOL-SDC-1317

vCloud Suite Use Cases - Business Critical Applications

Group Discussions:

VSVC1001-GD

Performance with Mark Achtemichuk

VSVC4811

Page 63: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

THANK YOU

Page 64: VMworld 2013: Extreme Performance Series: Monster Virtual Machines
Page 65: VMworld 2013: Extreme Performance Series: Monster Virtual Machines

Extreme Performance Series:

Monster Virtual Machines

Peter Boone, VMware

Seongbeom Kim, VMware

VSVC4811

#VSVC4811