vmworld 2015: extreme performance series - vsphere compute & memory

Extreme Performance Series:vSphere Compute & Memory

Fei Guo, VMware, IncSeong Beom Kim, VMware, Inc

INF5701

#INF5701

2

• This presentation may contain product features that are currently under development.

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

• Technical feasibility and market demand will affect final delivery.

• Pricing and packaging for any new technologies or features discussed or presented have not been determined.

Disclaimer

vSphere CPU Management

Outline• What to Expect on VM Performance

• Ready Time (%RDY)

• VM Sizing: How Many vCPUs?

• NUMA & vNUMA

Set the Right Expectation on VM Performance

6

What Happens When Idle Active?

VMK

VMM

VM

VT / AMD-V

-Privileged inst.-TLB miss

-VCPU state to RDY-Schedule-RDY Queue

IO HLT

-De-schedule VCPU-VCPU state to IDLE

-Issue to IO threads

7

When Your App is Slow in VM• High virtualization overhead

– A lot of privileged instructions / operations• CPUID, mov CR3, etc.

– A lot of TLB misses (addressing huge memory)• Large page helps a lot

• Resource contention– High ready (%RDY) time?– Host memory swap? (i.e. memory over-commit)

8

Reasonable Expectation on VM Performance• Best cases

– Computation heavy, small memory footprint– No CPU / memory over-commit– ~100% of the bare metal performance

• Common cases– Moderate mix of compute / memory / IO– Little to no CPU / memory over-commit– ~ 90% of the bare metal performance

• Worst cases– Huge number of TLB misses / privileged instructions– Heavy ESXi host memory swap

%RDY Can Happen Without CPU Contention

CPU Scheduler Accounting

10

A B C

t1 t2 t3

D

t4 t5

E

t6 t7 t8

CPU scheduling cost Time in ready queue Actual execution Efficiency loss from power

mgmt, hyper-threading, etc.Interrupted

%RDY

%RUN%OVRLP

%SYS += D if for this VM

%USED = %RUN + %SYS - %OVRLP - E

Meaning of High %RDY

11

A: Scheduling Cost B: Time In Ready Q C: Actual Execution

A B C– CPU contention– Limit, low shares– Poor CPU affinity– Poor load balancing

A C A C A C A C A C

e.g. Frequent Sleep/Wakeup

A C

A C

Troubleshooting High %RDY• High queue time

– Check DRS load balancing issue– Check CPU resource specification (limit, low shares)

• %MLMTD – Percent time in RDY state due to CPU limit

– Avoid using CPU affinity

• Dominant CPU scheduling cost– Change application behavior (avoid frequent sleep / wakeup)– Delay or do not yield PCPU

• monitor.idleLoopSpinUS > 0– Burns more CPU power– OK for consolidation

• LatencySensitivity = HIGH– Power efficient– Bad for consolidation

12

Same %RDY, Different Performance Impact

14

%RDY Impact on Throughput

0 2 4 6 8 10 12 14 16 18 200.75

0.80

0.85

0.90

0.95

1.00

1.05

%RDY

Thro

ughp

ut (b

ops)

• Throughput workload

• Java server

• CPU & memory

• %RDY ~ throughput drop

15

%RDY Impact on Latency• Latency workload

• In-memory key-value store

• CPU & memory

• %RDY can have significant impact on tail latency

• Same %RDY but different impact

0 5 10 15 20 250

2

4

6

8

10

12

14

16

18

spiky

flat

%RDY

99.9

9 Pe

rcen

tile

Late

ncy

(mse

c)

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

16

When %RDY Is Acceptable• VMs are consolidated into one NUMA node

– When VMs share data (communication, same IO context, etc.)– %RDY may increase– Better than running slowly without %RDY on separate NUMA nodes

• vSphere 6.0 becomes less aggressive– Leave 10% CPU headroom– Lower /Numa/CoreCapRatioPct to increase the headroom

Oversizing VM is Wasteful and Even Harmful

Unused VCPU Wastes CPU

18

RHEL5 100Hz (*) RHEL5 1kHz RHEL6 tickless (*) Win2k8 64Hz (*) Win2k8 1kHz0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

%US

ED

• Idle VCPU does consume CPU

• Can be significant with 1kHz timer (RHEL5 1kHz)

• Mostly trivial

Over-sizing VM Can Hurt Performance

19

• Single-threaded app

• Does not benefit from more VCPUs

• Hurt by in-guest migrations

1 2 4 8 16 32 640.00

0.20

0.40

0.60

0.80

1.00

1.20

VM Size (#vCPUs)

Thro

ughp

ut

ESXi is Optimized for NUMA

21

What is NUMA?• Non-Uniform Memory Access system architecture

– Each node consists of CPU cores and memory

• VM can access memory on remote NUMA nodes, but at a performance cost– Access time can be 30% ~ 200% longer

NUMA node 1 NUMA node 2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

VM

Good Memory Locality

ESXi Schedules VM for Optimal NUMA Performance

22

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

VM

Poor Memory Locality Without vNUMA.

Wide VM Needs vNUMA

23

24

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

VM

Good Memory Locality With vNUMA.

vNUMA Achieves Optimal NUMA Performance

Stick to vNUMA Default

Do Not Change coresPerSocket Without Good Reason• Changing default means you set vNUMA

size

• If licensing requires fewer vSockets– Find optimal vNUMA size– Match coresPerSocket to vNUMA size– e.g. 20-vCPU VM on 10 cores/node system

• Default vNUMA size = 10/vNode• Set coresPerSocket = 10

• Enabling “CPU Hot Add” disables vNUMA

26

Key Takeaways

28

Summary• Set the right expectation on VM performance

• %RDY can happen without CPU contention– Watch out for frequent sleep / wakeup

• Same %RDY, different performance impact– More significant impact on the tail latency

• Oversizing VM wastes CPU and may hurt performance

• ESXi is optimized for NUMA

• Stick to vNUMA default

• Check out CPU scheduler white paper– https://www.vmware.com/files/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf

vSphere Memory Management

Outline• ESXi Memory Management Basics

• VM Sizing

• Reservation vs. Preallocation

• Page sharing vs. Large page

• Memory Overhead

• Memory Overcommitment Guidance

Memory Terminology

Memory SizeTotal Amount of Memory

Allocated Memory Free Memory

Active MemoryAllocated Memory Recently

Accessed or Used

Idle MemoryAllocated Memory Not Recently

Accessed

31

Task of Memory Scheduler• Compute memory entitlement for each VM

– Based on reservation, limit, shares, and memory demand– Memory demand is determined by active memory

• Sampling based estimation

• Reclaim guest memory if entitlement < consumed

32

Performance goal: handle burst memory pressure well

33

Memory Reclamation Basics (vSphere 5.5 and earlier)

Host Memory

0 max

minFreeconsumedfree State:

HIGHLOW

STATE Page Sharing Ballooning Compression Swapping

High X

Low X X X XExpensive

Refer to http://www.vmware.com/files/pdf/mem_mgmt_perf_vsphere5.pdf for details.

Reservation vs. Preallocation

35

Different in Many Aspects• Reservation

– Used in admission control and entitlement calculation – Setting it does NOT mean memory is fully allocated– General protection against memory reclamation

• Preallocation– Memory is fully reserved AND fully allocated

• Advanced configure option: sched.mem.prealloc = TRUE

– Mostly used for latency sensitive workloads

VM Sizing

37

Guard Against “Active Memory” Reclamation• VM memory size > the peak demand• If necessary, setting reservation above guest demand

Page Sharing & Large Page

Memory Saving from Page Sharing• Significant for homogeneous VMs

Workload Guest OS # of VMs Total guest memory

Swingbench RedHat 5.6 12 48GB

VDI Windows 7 15 60GB

43%57%

Swingbench

Sharednon-Shared

Share saved

34%

75%

25%

VDI

73%

39

40

What “Prevents” Sharing • Guest features

– ASLR (Address Space Layout Randomization) • Less than 50MB sharing reduction

– Super fetching (proactive caching)• Largely reduces sharing• Increase in I/Os hurts VM performance

• Host features– Host large page

• ESXi does not share large pages• Page sharing scanning thread still works (generates page signatures)

Why Large Page?• Fewer TLB misses • Faster page table look up time

• Enable by default

Guest Large Pages Host Large Pages SPECjbb Swingbench

√ √ +30% +12%

× √ +12% +7%

√ × +6% -

× × - -baseline

41

42

Large Page Impact on Memory Overcommitment• Higher memory pressure due to no sharing• Broken when any small page is ballooned or swapped

– Sharing happens thereafter

0 1.5 3 4.5 6 7.5 910

.5 12 13.5 15 16

.5 18 19.5 21 22

.5 24 25.5 27 28

.5 30 31.5 33 34

.5 36 37.5

5000000

10000000

15000000

0

1000

2000

3000

4000

5000

6000

7000

Memory Overcommitment with Swingbench VMs

Time (minutes)

Bal

loon

ed/S

wap

ped/

Shar

ed M

emor

y(G

B)

# of

Lar

ge P

ages

nrLarge

Ballooned

Shared Swapped

43

New in vSphere 6.0• Add a new memory state “CLEAR” (between High and Low)

• Breaking large pages in Clear state– Only if they contain shareable small pages– Avoid entering Low state– Best use of large pages and small pages

High

minFree

LowClear

44

0 1 2 3 40.8

0.9

1

1.1

1.2

1.3

1.4VDI

ESXi 5.5

ESXi 6.0

# of Extra VMs

Ave

rage

Lat

ency

(sec

onds

)Performance Improvement

• ESXi 6.0 (with Clear state): sharing happens much earlier => no ballooning/swapping!

• Reference: http://dl.acm.org/citation.cfm?id=27311870

3.5 7

10.5 14

17.5 21

24.5 28

31.5 35

38.5

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

Total Ballooned + Swapped Memory (MB)

ESXi 5.5

ESXi 6.0

Time (minutes)

0 3 6 9 12 15 18 21 24 27 30 33 36

1

3,000,001

6,000,001

9,000,001

12,000,001

15,000,001

18,000,001

Total Shared Memory (GB)

ESXi 5.5

ESXi 6.0

Time (minutes)

Overhead Memory

46

Per Host & Per VM

• Composed of MANY components– In an idle host, kernel overhead

memory breakdown looks like this …

• Impossible to conduct an accurate formula

47

“Experimentally Safe” Estimation

– Per VM overhead • Less than 10% of configured memory

– Host memory usage without noticeable impact• <= 64GB : 90% of host memory• > 64GB: 95% of host memory

– Above are conservative!

Memory Overcommitment Guidance

• Two types of memory overcommitment– “Configured” memory overcommitment

• SUM (memory size of all VMs) / host memory size

– “Active” memory overcommitment• SUM (mem.active of all VMs) / host memory size

• Performance impact– “Active” memory overcommitment ≈ 1 high likelihood of performance degradation!

• Some active memory are not in physical RAM

– “Configured” memory overcommitment > 1 zero or negligible impact• Most reclaimed memory are free/idle guest memory

Configured vs. Active Memory Overcommitment

49

General Principles• High consolidation

– Keep “active memory” overcommitment < 1

• How to know the “active memory” of a VM?– Use vRealize Operations to track a VM’s average and maximum memory demand

• What if I have no idea about active memory…– Monitor performance counters while adding VMs

50

51

Host Statistics (Not Recommended)• mem.consumed

– Memory allocation varies dynamically based on entitlement– It does not imply performance problem

• Reclamation related counters– mem.balloon– mem.swapUsed– mem.compressed– mem.shared

Nonzero values do NOT necessarily mean a performance problem!

Example One (Transient Memory Pressure)• Six 4GB Swingbench VMs (VM-4,5,6 are idle) in a 16GB host

0

2000

4000

6000

8000

10000VM1 VM2 VM3

Time (minutes)

Ope

ratio

ns p

er M

inut

es

0

2000000

4000000

6000000

8000000

10000000

12000000

Balloon

Swap Used

Compressed

Shared

Time (minutes)

Size

(GB

)

∆VM1 = 0%∆VM2 = 0%

52

53

Which Statistics to Watch? • mem.swapInRate

– Constant nonzero value indicates performance problem

• mem.latency– % of time waiting for decompression and swap-in. – Estimate the performance impact due to compression and swapping

• mem.active– If active is low, reclaimed memory is less likely to be a problem

Example Two (Constant Memory Pressure)• All six VMs run Swingbench workloads

0

2000

4000

6000

8000

10000VM1 VM2 VM3 VM4 VM5 VM6

Time (minutes)

Ope

ratio

ns p

er M

inut

e

0

1000

2000

3000

4000

Swap-in Rate

Time (minutes)

KB

per

Sec

ond

∆VM1 = -16%∆VM2 = -21%

54

Key Takeaways

56

Summary• Track mem.{swapInRate, active, latency} for performance issues.

• VM memory should be sized based on memory demand.

• “Single digit” memory overhead.

• New ESXi memory management feature improves performance.

• ESXi is expected to handle transient memory pressure well.

Extreme Performance Series:vSphere Compute & Memory

Fei Guo, VMware, IncSeong Beom Kim, VMware, Inc

INF5701

#INF5701

vmworld 2015: extreme performance series - vsphere compute & memory

Technology