vmworld 2015: extreme performance series - vsphere compute & memory

59
Extreme Performance Series: vSphere Compute & Memory Fei Guo, VMware, Inc Seong Beom Kim, VMware, Inc INF5701 #INF5701

Upload: vmworld

Post on 13-Apr-2017

249 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Extreme Performance Series:vSphere Compute & Memory

Fei Guo, VMware, IncSeong Beom Kim, VMware, Inc

INF5701

#INF5701

Page 2: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

2

• This presentation may contain product features that are currently under development.

• This overview of new technology represents no commitment from VMware to deliver these features in any generally available product.

• Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

• Technical feasibility and market demand will affect final delivery.

• Pricing and packaging for any new technologies or features discussed or presented have not been determined.

Disclaimer

Page 3: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

vSphere CPU Management

Page 4: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Outline• What to Expect on VM Performance

• Ready Time (%RDY)

• VM Sizing: How Many vCPUs?

• NUMA & vNUMA

Page 5: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Set the Right Expectation on VM Performance

Page 6: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

6

What Happens When Idle Active?

VMK

VMM

VM

VT / AMD-V

-Privileged inst.-TLB miss

-VCPU state to RDY-Schedule-RDY Queue

IO HLT

-De-schedule VCPU-VCPU state to IDLE

-Issue to IO threads

Page 7: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

7

When Your App is Slow in VM• High virtualization overhead

– A lot of privileged instructions / operations• CPUID, mov CR3, etc.

– A lot of TLB misses (addressing huge memory)• Large page helps a lot

• Resource contention– High ready (%RDY) time?– Host memory swap? (i.e. memory over-commit)

Page 8: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

8

Reasonable Expectation on VM Performance• Best cases

– Computation heavy, small memory footprint– No CPU / memory over-commit– ~100% of the bare metal performance

• Common cases– Moderate mix of compute / memory / IO– Little to no CPU / memory over-commit– ~ 90% of the bare metal performance

• Worst cases– Huge number of TLB misses / privileged instructions– Heavy ESXi host memory swap

Page 9: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

%RDY Can Happen Without CPU Contention

Page 10: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

CPU Scheduler Accounting

10

A B C

t1 t2 t3

D

t4 t5

E

t6 t7 t8

CPU scheduling cost Time in ready queue Actual execution Efficiency loss from power

mgmt, hyper-threading, etc.Interrupted

%RDY

%RUN%OVRLP

%SYS += D if for this VM

%USED = %RUN + %SYS - %OVRLP - E

Page 11: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Meaning of High %RDY

11

A: Scheduling Cost B: Time In Ready Q C: Actual Execution

A B C– CPU contention– Limit, low shares– Poor CPU affinity– Poor load balancing

A C A C A C A C A C

e.g. Frequent Sleep/Wakeup

A C

A C

Page 12: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Troubleshooting High %RDY• High queue time

– Check DRS load balancing issue– Check CPU resource specification (limit, low shares)

• %MLMTD – Percent time in RDY state due to CPU limit

– Avoid using CPU affinity

• Dominant CPU scheduling cost– Change application behavior (avoid frequent sleep / wakeup)– Delay or do not yield PCPU

• monitor.idleLoopSpinUS > 0– Burns more CPU power– OK for consolidation

• LatencySensitivity = HIGH– Power efficient– Bad for consolidation

12

Page 13: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Same %RDY, Different Performance Impact

Page 14: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

14

%RDY Impact on Throughput

0 2 4 6 8 10 12 14 16 18 200.75

0.80

0.85

0.90

0.95

1.00

1.05

%RDY

Thro

ughp

ut (b

ops)

• Throughput workload

• Java server

• CPU & memory

• %RDY ~ throughput drop

Page 15: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

15

%RDY Impact on Latency• Latency workload

• In-memory key-value store

• CPU & memory

• %RDY can have significant impact on tail latency

• Same %RDY but different impact

0 5 10 15 20 250

2

4

6

8

10

12

14

16

18

spiky

flat

%RDY

99.9

9 Pe

rcen

tile

Late

ncy

(mse

c)

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

Page 16: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

16

When %RDY Is Acceptable• VMs are consolidated into one NUMA node

– When VMs share data (communication, same IO context, etc.)– %RDY may increase– Better than running slowly without %RDY on separate NUMA nodes

• vSphere 6.0 becomes less aggressive– Leave 10% CPU headroom– Lower /Numa/CoreCapRatioPct to increase the headroom

Page 17: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Oversizing VM is Wasteful and Even Harmful

Page 18: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Unused VCPU Wastes CPU

18

RHEL5 100Hz (*) RHEL5 1kHz RHEL6 tickless (*) Win2k8 64Hz (*) Win2k8 1kHz0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

%US

ED

• Idle VCPU does consume CPU

• Can be significant with 1kHz timer (RHEL5 1kHz)

• Mostly trivial

Page 19: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Over-sizing VM Can Hurt Performance

19

• Single-threaded app

• Does not benefit from more VCPUs

• Hurt by in-guest migrations

1 2 4 8 16 32 640.00

0.20

0.40

0.60

0.80

1.00

1.20

VM Size (#vCPUs)

Thro

ughp

ut

Page 20: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

ESXi is Optimized for NUMA

Page 21: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

21

What is NUMA?• Non-Uniform Memory Access system architecture

– Each node consists of CPU cores and memory

• VM can access memory on remote NUMA nodes, but at a performance cost– Access time can be 30% ~ 200% longer

NUMA node 1 NUMA node 2

Page 22: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

VM

Good Memory Locality

ESXi Schedules VM for Optimal NUMA Performance

22

Page 23: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

VM

Poor Memory Locality Without vNUMA.

Wide VM Needs vNUMA

23

Page 24: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

24

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

C5C4C3

C0 C1 C2

VM

Good Memory Locality With vNUMA.

vNUMA Achieves Optimal NUMA Performance

Page 25: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Stick to vNUMA Default

Page 26: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Do Not Change coresPerSocket Without Good Reason• Changing default means you set vNUMA

size

• If licensing requires fewer vSockets– Find optimal vNUMA size– Match coresPerSocket to vNUMA size– e.g. 20-vCPU VM on 10 cores/node system

• Default vNUMA size = 10/vNode• Set coresPerSocket = 10

• Enabling “CPU Hot Add” disables vNUMA

26

Page 27: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Key Takeaways

Page 28: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

28

Summary• Set the right expectation on VM performance

• %RDY can happen without CPU contention– Watch out for frequent sleep / wakeup

• Same %RDY, different performance impact– More significant impact on the tail latency

• Oversizing VM wastes CPU and may hurt performance

• ESXi is optimized for NUMA

• Stick to vNUMA default

• Check out CPU scheduler white paper– https://www.vmware.com/files/pdf/techpaper/VMware-vSphere-CPU-Sched-Perf.pdf

Page 29: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

vSphere Memory Management

Page 30: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Outline• ESXi Memory Management Basics

• VM Sizing

• Reservation vs. Preallocation

• Page sharing vs. Large page

• Memory Overhead

• Memory Overcommitment Guidance

Page 31: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Memory Terminology

Memory SizeTotal Amount of Memory

Allocated Memory Free Memory

Active MemoryAllocated Memory Recently

Accessed or Used

Idle MemoryAllocated Memory Not Recently

Accessed

31

Page 32: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Task of Memory Scheduler• Compute memory entitlement for each VM

– Based on reservation, limit, shares, and memory demand– Memory demand is determined by active memory

• Sampling based estimation

• Reclaim guest memory if entitlement < consumed

32

Performance goal: handle burst memory pressure well

Page 33: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

33

Memory Reclamation Basics (vSphere 5.5 and earlier)

Host Memory

0 max

minFreeconsumedfree State:

HIGHLOW

STATE Page Sharing Ballooning Compression Swapping

High X

Low X X X XExpensive

Refer to http://www.vmware.com/files/pdf/mem_mgmt_perf_vsphere5.pdf for details.

Page 34: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Reservation vs. Preallocation

Page 35: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

35

Different in Many Aspects• Reservation

– Used in admission control and entitlement calculation – Setting it does NOT mean memory is fully allocated– General protection against memory reclamation

• Preallocation– Memory is fully reserved AND fully allocated

• Advanced configure option: sched.mem.prealloc = TRUE

– Mostly used for latency sensitive workloads

Page 36: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

VM Sizing

Page 37: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

37

Guard Against “Active Memory” Reclamation• VM memory size > the peak demand• If necessary, setting reservation above guest demand

Page 38: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Page Sharing & Large Page

Page 39: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Memory Saving from Page Sharing• Significant for homogeneous VMs

Workload Guest OS # of VMs Total guest memory

Swingbench RedHat 5.6 12 48GB

VDI Windows 7 15 60GB

43%57%

Swingbench

Sharednon-Shared

Share saved

34%

75%

25%

VDI

73%

39

Page 40: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

40

What “Prevents” Sharing • Guest features

– ASLR (Address Space Layout Randomization) • Less than 50MB sharing reduction

– Super fetching (proactive caching)• Largely reduces sharing• Increase in I/Os hurts VM performance

• Host features– Host large page

• ESXi does not share large pages• Page sharing scanning thread still works (generates page signatures)

Page 41: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Why Large Page?• Fewer TLB misses • Faster page table look up time

• Enable by default

Guest Large Pages Host Large Pages SPECjbb Swingbench

√ √ +30% +12%

× √ +12% +7%

√ × +6% -

× × - -baseline

41

Page 42: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

42

Large Page Impact on Memory Overcommitment• Higher memory pressure due to no sharing• Broken when any small page is ballooned or swapped

– Sharing happens thereafter

0 1.5 3 4.5 6 7.5 910

.5 12 13.5 15 16

.5 18 19.5 21 22

.5 24 25.5 27 28

.5 30 31.5 33 34

.5 36 37.5

5000000

10000000

15000000

0

1000

2000

3000

4000

5000

6000

7000

Memory Overcommitment with Swingbench VMs

Time (minutes)

Bal

loon

ed/S

wap

ped/

Shar

ed M

emor

y(G

B)

# of

Lar

ge P

ages

nrLarge

Ballooned

Shared Swapped

Page 43: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

43

New in vSphere 6.0• Add a new memory state “CLEAR” (between High and Low)

• Breaking large pages in Clear state– Only if they contain shareable small pages– Avoid entering Low state– Best use of large pages and small pages

High

minFree

LowClear

Page 44: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

44

0 1 2 3 40.8

0.9

1

1.1

1.2

1.3

1.4VDI

ESXi 5.5

ESXi 6.0

# of Extra VMs

Ave

rage

Lat

ency

(sec

onds

)Performance Improvement

• ESXi 6.0 (with Clear state): sharing happens much earlier => no ballooning/swapping!

• Reference: http://dl.acm.org/citation.cfm?id=27311870

3.5 7

10.5 14

17.5 21

24.5 28

31.5 35

38.5

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

Total Ballooned + Swapped Memory (MB)

ESXi 5.5

ESXi 6.0

Time (minutes)

0 3 6 9 12 15 18 21 24 27 30 33 36

1

3,000,001

6,000,001

9,000,001

12,000,001

15,000,001

18,000,001

Total Shared Memory (GB)

ESXi 5.5

ESXi 6.0

Time (minutes)

Page 45: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Overhead Memory

Page 46: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

46

Per Host & Per VM

• Composed of MANY components– In an idle host, kernel overhead

memory breakdown looks like this …

• Impossible to conduct an accurate formula

Page 47: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

47

“Experimentally Safe” Estimation

– Per VM overhead • Less than 10% of configured memory

– Host memory usage without noticeable impact• <= 64GB : 90% of host memory• > 64GB: 95% of host memory

– Above are conservative!

Page 48: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Memory Overcommitment Guidance

Page 49: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

• Two types of memory overcommitment– “Configured” memory overcommitment

• SUM (memory size of all VMs) / host memory size

– “Active” memory overcommitment• SUM (mem.active of all VMs) / host memory size

• Performance impact– “Active” memory overcommitment ≈ 1 high likelihood of performance degradation!

• Some active memory are not in physical RAM

– “Configured” memory overcommitment > 1 zero or negligible impact• Most reclaimed memory are free/idle guest memory

Configured vs. Active Memory Overcommitment

49

Page 50: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

General Principles• High consolidation

– Keep “active memory” overcommitment < 1

• How to know the “active memory” of a VM?– Use vRealize Operations to track a VM’s average and maximum memory demand

• What if I have no idea about active memory…– Monitor performance counters while adding VMs

50

Page 51: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

51

Host Statistics (Not Recommended)• mem.consumed

– Memory allocation varies dynamically based on entitlement– It does not imply performance problem

• Reclamation related counters– mem.balloon– mem.swapUsed– mem.compressed– mem.shared

Nonzero values do NOT necessarily mean a performance problem!

Page 52: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Example One (Transient Memory Pressure)• Six 4GB Swingbench VMs (VM-4,5,6 are idle) in a 16GB host

0

2000

4000

6000

8000

10000VM1 VM2 VM3

Time (minutes)

Ope

ratio

ns p

er M

inut

es

0

2000000

4000000

6000000

8000000

10000000

12000000

Balloon

Swap Used

Compressed

Shared

Time (minutes)

Size

(GB

)

∆VM1 = 0%∆VM2 = 0%

52

Page 53: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

53

Which Statistics to Watch? • mem.swapInRate

– Constant nonzero value indicates performance problem

• mem.latency– % of time waiting for decompression and swap-in. – Estimate the performance impact due to compression and swapping

• mem.active– If active is low, reclaimed memory is less likely to be a problem

Page 54: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Example Two (Constant Memory Pressure)• All six VMs run Swingbench workloads

0

2000

4000

6000

8000

10000VM1 VM2 VM3 VM4 VM5 VM6

Time (minutes)

Ope

ratio

ns p

er M

inut

e

0

1000

2000

3000

4000

Swap-in Rate

Time (minutes)

KB

per

Sec

ond

∆VM1 = -16%∆VM2 = -21%

54

Page 55: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Key Takeaways

Page 56: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

56

Summary• Track mem.{swapInRate, active, latency} for performance issues.

• VM memory should be sized based on memory demand.

• “Single digit” memory overhead.

• New ESXi memory management feature improves performance.

• ESXi is expected to handle transient memory pressure well.

Page 57: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
Page 58: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory
Page 59: VMworld 2015: Extreme Performance Series - vSphere Compute & Memory

Extreme Performance Series:vSphere Compute & Memory

Fei Guo, VMware, IncSeong Beom Kim, VMware, Inc

INF5701

#INF5701